Simulation Procedures and Results - A message-driven VLSI architecture for parallel object-orie

6.5.1. U ser Interface and Sim ulation Procedure

The simulation was used to exercise the primary realisations of the BROOM architecture. Two test routines were programmed to emulate the opposite regimes of local and distributed processing. The system behaviour was observed by meters showing statistics of each component. Messages could be exchanged with nodes to intervene into the workload profile.

The two programs used to activate the simulation were simple message generators. They would generate messages within a set of intervals, ranging from two to eight cycles. This emulates the profile of a heavy communication load because these are the minimum number of cycles necessary to generate a message. The local program always sends messages to the same node and the distributed program sends messages to a random destination. Instances of these programs can be loaded interactively in any memory position in any node to shape the desired workload.

To observe the system behaviour, the user interface displays a dynamic picture of the system consisting of windows with information for each component. The user interface was based on the C Curses library to format the screen into windows. There was one window for each node and global system windows. The system windows are the network scheduler, the monitor and input and output slots. Each window is subdivided in fields presenting the collection of information pertinent to the component.

The main observed measurements were:

Maximum L atency the time elapsed from message creation to its acceptance

M essage L ifetim e the time elapsed from message creation to its destruction (average).

System scheduler display of nodes current scheduled.

Input Q ueue display of message headers in the queue of a node.

Q ueue Sizes current sizes of input and output queues of a node.

U nits A ctivity percentage of activity of each pipeline unit in a message

step. A low value means that the unit is idle.

More information was also available through reports given by the monitor system when required, accounting for page faults, node activity and error conditions. The monitor also maintains an interface that accepts interrupts from the operator to change the speed of simulation, send messages, or request reports on resource usage.

6.5.2. Sim ulation R esults

Unfortunately, the simulation could not produce concrete statistics results due to the discontinuation of software and hardware support. The machine used for the simulation was exchanged for another model and the C + + compiler supplied proved to be completely incompatible with the original code. Only the first tests were available and all enhancements inferred from the first tests were carried direct to VLSI implementation without being reiterated in a new simulation run.

The results assessed from the simulation were used to guide the improvements to the architecture described in the previous chapters. Despite the few iterations allowed to the simulation, it has been enough to indicate several modifications on the basic simulated architecture. Some of the problems exposed were the register organisation, relay protocol and cache strategy. The modifications

were mainly directed to organisation of units, since the configuration of the elements (size of messages, queues, etc.) remained constant throughout the simulation.

At system level, the main observation was on the behaviour of the network traffic. In the configuration tested, the network consisted of four nodes. The workload was incremented interactively until the network was saturated. The average message latency and lifetime measured were around 32 and 80 machine cycles respectively.

The buffer transfers were found responsible for an unnecessary extra delay, specially in the message lifetime. The solution was to eliminate these intermediate registers and route message access directly to the message queue memories. On average, this eliminates 32 cycles in the message lifetime.

The queue sizes were observed to keep a size around 2 messages long. This is mainly due to the self-controlled nature of the message generators, emulating "well behaved" algorithms. The good behaviour of these algorithms avoids the generation of a new instance of itself for each message sent. In contrast, "greedy" algorithms can multiply instances exponentially and overflow the queues. Well behaved algorithms indicated that the relay mechanism would be well tolerated as an exception handling mechanism, introducing an affordable overhead, since it would have a low occurrence profile.

The indexes measuring the relative interdependence between message pipeline units indicated a good balance in resource contention. They were marking around 90 percent activity most of the time. However, there were occasions when the execution unit activity dropped very low, in the occurrence of a page fault. Page faults were induced by moving objects across the memory.

The memory bus bandwidth is insufficient to restore cache positions before very small granularity programs exhaust the input queue. A program is considered small in terms of BROOM architecture when the code is just enough to receive and forward a single message. This can be done within six to twelve cycles. The memory bandwidth allows the update of a cache position within 20 or 28 cycles. Since BROOM proposes to support small granularity, the cache hit rate had to be optimised. This led to the development of a better caching strategy to make best use of on chip memory, to reduce message decoding times and to increase the chance of a cache hit.

The associative memory list cache strategy mentioned in section 5.2.4 would reduce the number of times that page faults would occur. However, in case it happens, there is still the possibility of rendering the execution unit idle. The only envisaged remedy would be to increase the memory bus bandwidth with large buses and external caches. None of the solutions could be adopted due to implementation restrictions.

The simulation was mainly an exercise in architecture design. The result was an evolution from a primitive implementation version to a more adequate design, aware of VLSI implications. The simulation enabled the observation of designed algorithms in motion, attesting their feasibility and proper behaviour. The empirical results obtained served as guidance to redirect resources from where they were introducing overhead, to an area were they can output full parallelism.

Chapter 7: VLSI Implementation

This chapter reports the architecture implementation into a microchip, using the CMOS 2microns technology. It describes the tools and libraries used for the design. The design approach takes into consideration the available resources for the project. A modular cell library was designed to implement the specialised operative parts. The strategies for control and operative parts are discussed for the resulting implementation.

In document A message-driven VLSI architecture for parallel object-oriented systems (Page 103-107)