Galatea Mapper Implementation - Analysis, Representation and Mapping of Neural Networks onto P

To implement the GPNC simulator on SUN workstations, the first step is to enable parallel simulations on a SUN local area networks (LAN). For this purpose Unix TCP/IP sockets have been used as the communications medium. TCP/IP is widely available on SUN and DEC workstation LANs, providing reliable, flow-controlled two-way transmission of data and messages. Initially, a prototype was developed to test the reliability of the communications medium. This prototype was used in chapter 4, in the simulation of parallel Hopfield nets. Later, this prototype was incorporated into the GPNC simulator, and has become the core of a comprehensive server, which undertakes scheduling tasks.

An evolutionary approach was followed in the implementation of the Mapper. It has involved, the manual mapping of parallel VML code onto a multi-VM environment, and its execution. Later, a semi-automatic mapper was developed, and an automatic mapper has been planned.

The first step in the implementation of the mapper is manual mapping or programming. After assessing the execution and the communications requirements for a given application and hardware characteristics, parallel VML code is written for all the parallel modules involved in the execution. The inter-VM data dependency and the load balance must be estimated, and data transfer instructions must be explicitly written by the programmer. The data movement commands are written in such a way that they display a handshake pattern between different modules. That is, a get_data statement in one VM is matched by a putjdata in the other. If inter-VM data dependencies are not followed properly, the execution might come to a halt as a result of an unmatched data request statement. The time-sequence of the events must be planned in advance for a successful multi-processor parallel execution.

The second step in the mapper implementation is semi-automatic mapping. This involves the mapper parsing the sequential (raw) VML definition and generating parallel

VML code, following the user directives stored in the placement table file. This file provides the total number of parallel VMs in the GPNC configuration, and the instructions for where to map each VML rule in the raw VML listing. The task mapping is straightforward and user-driven. It relies on the users’ mapping instructions, based on their analysis and judgement of the application. The semi-automatic mapper is a C

program which carries out the following tasks:

• Parses raw VML definition, • Reads user mapping directives,

• Identifies and forms VML rule objects, • Generates parallel VML listings.

The final step in the mapper implementation involves the full automation of the parallel code generation process. The Automatic Mapper generates parallel VML code for all VMs in the GPNC configuration, based on the minimisation of the computational costs. Automatic mapping is done by calculating parallel processing and communications costs on all possible combinations of rule mappings on a coarse number of VMs. The partitioning that results in the minimum computational cost is selected, the rules are grouped, and ASCII listings of the parallel VML code are generated with correct data transfer instructions. The Galatea project chose the VML rules to be the lowest level objects to be mapped. To achieve that, the Galatea Mapper breaks the raw VML code into self sufficient rule objects, all with a data definition part and a rule body. The automatic mapper consists of the following modules:

• VML Parser

• Variable and Rule Analysis

• Calculation of the Computational Costs • Parallel VML code generation

Now, let us examine the modules of the automatic mapper, and the parameters involved in the automatic generation of parallel code for the Galatea GPNC.

VML Parser - The Mapper uses the same parser routines as the VML interpreter [25].

This approach reduced the workload, as it ensures that the mapper automatically follows the modifications in the VML syntax. As a result of the parsing, the mapper generates its internal model of the VML rule and data structures upon which it carries out the variable and rule analysis.

VML Variable and Rule Analysis - MIMD machines suffer from a data dependency problem. A classification of variable types is necessary to identify which variables have to be transmitted to other VMs, in a parallel execution. The Mapper carries out a variable analysis in which all variables are classified into 4 different variable types:

• Constant • Local • Read • Write

The Constant type needs to be transmitted only once, at the beginning. This type of variables stay unchanged throughout the execution. The Local type refers to temporary variables. The Read type needs to be received from the Scheduler (or other VMs). Finally, the Write type implies that a variable has been modified within that VM, and should be transmitted to the Scheduler or other VMs. Similarly, a rule analysis is carried out to reveal the rule dependency, and to buüd the rule hierarchy for all the subrules and the caller rules for each rule.

Computational Costs - Performance of the parallel execution, which is defined as the computational cost of the execution depends on two different parameters: Communications Costs and Processing Costs. The communications costs directly depend on the amount of data exchanged between various processors (VMs). Using the variable and rule analysis for a given mapping, data sizes are calculated for all the Read and Write type variables. The VML LOOP statement is decoded to obtain an approximate measure for the number of repetitions occurring for each command line. This measure is necessary to determine the volume of data which is transferred between the VMs and to establish the processing costs within loops. Real processing costs are hardware dependent. For a given architecture, the processing costs depend on the following parameters:

• Operation Type • Data Type • Data Size • Placement Type

• Hardware Characteristics (speed and memory)

After identifying these parameters, estimates of the processing costs for each operation type were requested from Siemens and Phihps hardware groups, with the intention to use the parameters as inputs for automatic mapping strategies. As a result, the following issues are highlighted;

1 - The optimal data type must be decided in VML code before or during the mapping. This is a very important issue for the hardware, and it is hardware-specific. For example

the Philips accelerator board achieves the highest performance on fixed point arithmetic operations.

2 - The optimal placement type must be defined for different data types in VML. The Siemens group [17] stressed the importance of the correct placement of data on the local memories of their VM, at the initial mapping stage. Four types of memory were reported on the Siemens hardware, namely: wmem, ymem, zmem and cniem. Siemens also listed the best memory placement for a number of data types as follows:

Placement Type Memory Type

PLACEMENT_COMMS PLACEMENT_FREE wmem PLACEMENT_STATE ymem PLACEMENT_WEIGHT wmem PLACEMENT_PATTERN ymem PLACEMENT_LUT cmem PLACEMENT-INPUT ymem PLACEMENT_OUTPUT ymem PLACEMENT_TEMPORARY ymem

3 - The calculation times for matrices of various sizes can be made available after the manufacturing and tests. Only approximate information was given about the clock cycles for certain matrix operations which will be executed on the VM hardware.

Following these developments the new version of VML (version 2.0), incorporated

PLACEMENT_TYPE field in the matrix declaration. As VML has no concept of pattern or weights this field can be ideally decided by the N level and passed down to VML through the compiler. Mimetics notified that they could generate PLAŒMENT_STATE, PLACEMENT_WEIGHT, PLACEMENT_TEMPORARY placement types using the N to VML

compiler. Mimetics also suggested that INPUT, OUTPUT placement types can be determined in V, and hinted that new placement types could be necessary for optimal mapping. In addition to this, VML 2.0 also includes the data type information prefixed to all VML instructions. In this way, VML becomes similar to low level VML (LLVML)

[42] and the generation of the low level commands on the VMs, is made easier.

Mimetics automatically generated VML 2.0 code for a number of neural network models including the Gradient Descent Backpropagation model. N to VML compiler is used in the automatic optimisation and generation of the raw VML code, completing the

compiler and the Mapper running together and generating parallel VML code for the VMs. The compiler generates the raw VML code, the symbol table, and the configuration table. The Mapper generates parallel VML and the Placement table to indicate the rule placement on the VMs. This data can also be used if a re-configuration is required like in the case of Dynamic Mapping.

In document Analysis, Representation and Mapping of Neural Networks onto Parallel Hardware (Page 127-131)