The Computational Analysis Tool is extended to generate parallel or pipelined
MATLIB definitions automatically. The result is the Automatic Parallel Mapper, which is a C function that exploits the variable and loop analyses carried out by CAT, and automatically generates parallel MATLIB definitions with complete data and function definitions, and matching data transfer instructions. APM mapping decisions are based on the same parameters provided to CAT. These are the number of parallel processors, their computational characteristics (in the computational lookup table), and the main parameter of the communications architecture; the transmission rate in Bits/sec. CAT analyses MATLIB programs and explores data and task parallelisms, depending on its projections and decision APM generates parallel code (Figure 6.3).
Data parallelism can be performed by exploiting the line by line computational analysis carried out by CAT. The pseudocode for the automatic generation of data parallel programs is as follows:
1 - Read the number of parallel processors, the communications speed, and the look-up table relating to the unit execution costs of the MATLIB functions on the hardwares involved.
2 - Calculate the sequential computational cost of the MATLIB operation for the host.
3 - Calculate the potential parallel computational cost for the current MATLIB
operation on the available parallel hardware. This parallel cost includes the distribution costs, parallel processing costs and re-assembly costs.
4 - Compare the sequential computational cost with the parallel cost, and if the parallel cost is smaller, then generate parallel code.
The automatic generation of the parallel code involves the following:
• On the host side: the division of matrix data into the number of processors, the generation of put_ instructions to transmit partitioned data, and get_ instructions to
Outputs no VMs comp. lookup table comme. speed Parallel MATUB Parallel MATLIB MATLIB listing Lœp Analysis Variable Analysis Sequential Costs Parallel Costs Computational Analysis Tool Automatic Parallel Mapper
Figure 6.3. Automatic Parallel Mapper
• On the parallel clients side: the data definition statements, the matching get_
instructions to obtained partitioned matrices, the matrix operation, and the transmission of results by put_ which is matched at the host end.
Task parallelism is also possible using the CAT results. Repetitive sequential operations such as the stages of a network training can be pipelined through an array of processors. Certain heuristic rules can be followed to prevent partitionings which would result in a high interprocess communications. For example, loops represent a high interconnectivity, and they should not be broken unless there are subloops which can be mapped onto separate processors. The pseudocode for instruction pipelining is as follows:
1 - Read the number of parallel processors, the communications speed and the lookup table.
2 - Calculate the line by line, and total computational cost for the sequential execution on the host.
3 - Divide the total cost by the number of processors (assuming they are identical processors). This gives the ideal cutting lines with a perfect load balance for a pipeline.
4 - Cut only the forward data flow streams, and avoid to cut any backward flow data paths.
5 - Generate the host and slave sections of the code, with correct data definition and data exchange statements.
Once parallel code listings are generated, they are compiled using a C compiler and they are ready for execution the parallel system.
6.7. Summary
In this chapter, firstly, the motivations and design considerations have been outlined, other mapping techniques have been reviewed. A number of model-specific and hardware-specific techniques have been assessed. It was concluded that they cannot be generalised to all models and hardwares. The execution, representation and mapping strategies have been estabhshed. The execution strategy is based on the exploitation of general-purpose parallel processors. The Virtual Machine philosophy promotes the use of vector-based hardwares in a general-purpose neural computing framework. Matrix- based representations seem to capture neural network properties, and are suitable for general-purpose parallel hardwares. The mapping strategy has been outlined as the computational optimisation of execution costs, as a general, flexible and potentially upgradable approach.
The main challenge in the computational optimisation is the parameterisation and costing of the executions. For this purpose, the computational costs for data and task parallelism are parameterised. A Computational Analysis Tool has been designed and developed for analysing and profiling matrix-based MATLIB representations. Once the computational costing is achieved, the automatic generation of the parallel code is a relatively simple task. An Automatic Parallel Mapper has been designed for this purpose which exploits CAT results in the automatic code generation. Chapter 8, presents the CAT/APM projections and mappings of MATUB neural network representations.
Chapter 7
Galatea Mapper and Scheduler
In this chapter, the Galatea Mapper and Scheduler are presented. As part o f this thesis work, the Galatea Mapper is implemented, the Scheduler is specified, and using the GPNC simulator a number o f parallel simulations have been carried out to test mapping strategies.
7.1. Introduction
This chapter presents the Galatea Mapper development as part of the Galatea GPNC simulator. Research presented in this chapter is the work of the author, carried out as part of the Galatea Project. The Galatea Mapper shares the same design considerations set out for the Mapper in chapter 6, and the mapping and scheduling strategies developed in this chapter are directly related to the research goals outlined for this thesis. Both the Mapper and the Galatea Mapper are designed to map matrix-based representations
{MATUB and VML) onto coarse-grain parallel architectures. This chapter presents the Galatea Mapper in the context of the Galatea GPNC to highlight the real-world constraints and the outcome.
The Galatea General Purpose Neural Computer is an advanced architecture, designed for the execution of a range of neural network models and applications. In chapter 3, a detailed description of the Galatea system is provided, including its general programming environment and multi-processor parallel architecture. In this chapter, the Galatea Mapper is presented in the context of the Galatea GPNC automatic code generation chain. As the mapper is responsible for the initial scheduling of the parallel execution, the scheduler development is also described.
As seen in chapter 3, the Galatea GPNC is a coarse-grained parallel architecture, which brings together generic hardware devices called Virtual Machines. Each VM consists of a Communications Unit, responsible for the control of the accelerator board, and an Execution Unit containing an accelerator board. The communications unit typically consists of a CPU and a large local RAM. Execution units are based on fast matrix operator generic boards, currently being developed by Siemens and Philips [41,42]. VMs communicate and interpret VML, a matrix-based, intermediate-level
language, consisting of scalar and matrix arithmetic operations. Consistent with the VM philosophy, the user-interface is designed in a parallel distributed fashion with a number of independent processes called Graphical Virtual Machines (GVM). All the independent modules of the GPNC; VMs, GVMs and the Scheduler, are interfaced with a message passing conununications protocol.
The Galatea Mapper’s duty in this environment is to partition and distribute the
VML representation across a number of parallel VMs. The Mapper is responsible for the initial scheduling of the execution, and for generating necessary data transfer instructions accordingly. The Galatea Mapper processes raw VML and generates parallel VML. This is essentially VML with a number of data exchange commands, which are interpreted by the Scheduler and the Communications Units of the VMs. After the initial mapping, the execution is controlled by the Scheduler, which serves as the conununications interface.
In this chapter, first, the Mapper is defined in the context of the Galatea GPNC automatic code generation process. Secondly, the Scheduler, the communications protocol and the run-time operations are explained. Then, the implementation steps, the practical issues relating to the implementation and problems encountered during the implementation, are discussed. Finally, Galatea mapping examples and parallel simulations on the Galatea GPNC are presented.