Load Balancing by Function Distribution on the EM-4 Prototype

(1)

Load

Balancing

by

Function

Distribution

on

the

EM-4

Prototype

Y.

Kodama

S.

Sakai

Y.

Yamaguchi

Electrotechnical Laboratory Tsukuba, Ibaraki 305, JAPAN

Abstract

The EM-4 is a highly parallel daiaflow machine that will eventually have more than 1,000 processing elements (PEs). This paper presents load balancing methods by function distribution in the EM-4 and their evaluations on the EM-4 prototype, which con-sists of 80 PEs. The EM-4 can distribute function in-stances statically by using several different allocation functions, including a method exploiting the locality

of the network. Furthermore, the EM-4 can dynami-cally distribute function instances using MLPE packets which circulate through the PEs and detect the local-minimum load PE. These function distn’bution meth-ods are evaluated by executing a divide-and-conquer program and a game tree searching program, and ex-amining its dynamic characteristics on every PE.

1 Introduction

A highly parallel computer is considered to be promising for new computer systems, but there are many important problems to solve, such as synchro-nization, communication, and the extraction of par-allelism. Among those, load balancing[l, 2] is one of the most significant problems. Here, load balancing means the following procedures in order to achieve the highest performance in solving certain problems: to determine when, where, and how to distribute the activities among processing elements(PEs).

Three levels are usually considered in the activity-distribution procedures: 1) instruction level; 2) block or loop level; and 3) function level. Instruction-level activities and block-level activities are relatively sta-ble, i.e., an actual action of each instruction or each loop can be predicted to some degree, even if there are conditional branches and variable loop counters. On the other hand, function-level activities are not so sta-ble, especially those for solving the irregular programs such as Monte Carlo simulations, symbolic problems,

and AI problems, which will be, in the near future, much more import ant than they are now.

This paper examines the function distribution methods, i.e., function-level activity-distribution pro-cedures. Function distribution balances the load of PEs by means of allocating functions to the less loaded PE. There are two ways of function distribution: a static function distribution and a dynamic function distribution. In the former method the PE which ex-ecutes the function (callee PE) is selected before the program begins, but in the latter the callee PE is se-lected depending upon the status of PEs during the program execution.

This paper presents the methods used and the eval-uations of funct ion distribution on the EM-4, a highly parallel machine being developed at the Electrotech-nical Laboratory in Japan. Section 2 is an overview of the EM-4 and its prototype implementation, in-cluding its performance on several sample programs. Section 3 describes the load balancing methods of the EM-4. It illustrates the general methodology, the function call mechanism, and the static and dynamic function distribution methods on the EM-4 proto-type. Section 4 describes the evaluations of the load balancing by the static and dynamic function distri-bution methods. Section 5 concludes this paper and discusses our future plans.

2 The

EM-4

and

its

Prototype

2.1 Distinctive Feat ures of the EM-4

The objectives of the EM-4 project are as follows:

1.

2.

to develop an efficient general purpose parallel computer including more than 1,000 PE by im-proving dataflow architectures.

To design a fast and versatile interconnection net-work wit h minimal hardware.

(2)

To achieve the first objective, a dataflow model and a dataflow architecture were refined [11], and a new dataflow model was introduced, the “strongly con-nected arc model”. In this model a local block is exclusively executed in a single PE, without creating packets for intermediate results. This property en-ables register execution of adataflow graph and real-izes an advanced-control pipeline. The EM-4 adopts an integrated pipeline consisting of a packet-based cir-cular pipeline and a register-based advanced control pipeline [9]. Additionally, a direct matching scheme was introduced and implemented for simple and fast data matching [11].

Construction of an interconnection network with both a hardware cost of O(N) (A’ : number of PEs) and a distance between any two processors of O(logN) was completed in order to realize the second objective. The EM-4 uses a “circular omega” network topol-ogy to satisfy these conditions [10] as shown in Figure 1. The EM C-R contains both a data switching ele-ment and a data processing element which work con-currently and independently. In addition, the network in the EM-4 contains store-and-forward deadlock pre-vent ion facilities and automatic load-balancing facili-ties.

2.2

The

EM-4

Prototype

As the first step toward making a 1,000 PE com-puter, the EM-4 prototype [6] with 80 PEs was devel-oped and has been fully operational since April 1990. A single-chip processor

EMC-R

[9] was designed and implemented for this prototype. The

EMC-R

in-cludes all the function units of a PE including a net-work switch. It was fabricated with CMOS gate ar-ray technology and contains 45,788 gates. The EM-4 prototype was developed for showing the actual effec-tiveness of the architecture, evaluating several archi-tectural aspects, providing software-development en-vironments, and confirming the final EM-4 architec-ture. It is controlled by a 12.5 MHz synchronous clock and its peak performance is 1 GIPS.

Before we examine the load balancing methods, sev-eral significant features of the EM-4 should be dis-cussed. They are: execution performance of a PE, synchronization overhead, pipeline organization, and network performance.

(1) Local Execution Performance

The EMC-R is a RISC chip whose instructions are all executed in a single clock cycle, 80ns, and whose throughput is 12.5 MIPS. Local data are stored in a register file and operated on by a register-based advanced-control pipeline. In this case, there is no

PEIO,O] PEIO,l] PE[0,2]

a~a b ----’.,. /l__J--’.. _. ./t_/--’’... .!I__J- b ‘, j ... j- ,.,~J ... ; PEII,O] ‘q PE[2,1] ..”.. ~ PE[3,2] c ...% d f PE[3,0] PE[l,l] g -g h ... ... .... ... ... .+ h

circular path for grouping

...

circular path for load distribute

[GA,CA] GA=group number CA=member number

Figure 1: Circular Omega Network

need for dynamically synchronizing activities such as packet matching.

(2) Packet Execution Performance

Packets are executed on a circular pipeline whose throughput is 6.3 MIPS for a monadic operation and 4.2 MIPS for a dyadic operation.

Figure 2 shows the function units in the EMC-R, which realize the circular pipeline. The Switching Unit (SU) dispatches packets to network or the Input Packet Buffer Unit (IBU) independent of instruction execution. The IBU has a 32-word FIFO for a packet buffer. If the FIFO is full of packets, the overflow packets will be stored in the Memory Input packet Buffer (MIB) in an off-chip memory. The Fetch and Matching Unit (FMU) checks the matching memory and fetches the instruction if the matching data exists. The EXecution Unit (EXU) executes the instruction and the result is sent as a packet.

(3) Synchronization Overhead

In the circular pipeline, data matching is carried out in the FMU without associative mechanisms; this is called direct matching [11]. The direct mat thing scheme is a frame-based data mat thing scheme and realizes an efficient synchronization [12]. This scheme is similar to the Explicit Token Store (ETS) of the Monsoon Machine [7]. It has smaller size of instruc-tions and packets than ETS, but cannot re-use the matching memory in a function while ETS can. It takes a single clock and there are no additional clocks

(3)

Table 1: Evaluation by Sample Programs

Program EM-4 SPARC-330 VAX8800 CRAY/X-MP

ctime sec MIPS ct ime rat io et ime ratio ct ime ratio

FIB(23) 0.00453 223 0.054 11.92 0.134 29.58 0.0865 19.09 PRIME(65536) 0.506 508 16.88 33.36 37.95 75.00 10.60 20.95 PI(4000) 0.369 824 34.7 94.04 37.2 100.81 4.942 13.39 SUM(65535) 0.00042 780 0.0106 25.23 0.0595 141.67 0.0025 5.95 MTRX(80X80) 0.00888 815 0.7 78.82 2.0 225.23 0.0035 0.394 2.3

Performance

Evaluation

Network

I

-Ga

-@a

--@J

Network

L J

Figure 2: Function Units in the EMC-R

for anything like the hash miss-hitting.

On the other hand, in the register-based advanced-control pipeline, the sequence of instruction execution is first determined by a compiler, so there is no syn-chronization overhead.

(4) Network Performance

The network topology of the EM-4 is circu-lar omega[lO]. In the EM-4 prototype all the data switching facilities are implemented on EMC-R chips. The EM C-R includes the SU, which is a 3-by-3 buffered crossbar switch and handles packet switching independently of other function units.

Each packet word is transferred in a single clock cycle, 80 ns. The peak data transfer rate is 60.9 MB/s/port and 14.63 GB/s in a system. If no con-tention in the network occurs, the maximum delay of a packet is 880 ns and the average delay is 442 ns in the prototype, assuming the packet is destined ran-domly to any PE. If the communication is inside the PE group, then the maximum and average delay of a packet are 320 ns and 160 ns. The effect of packet transfer contention is stated in [10]

Several programs have been executed on the EM-4 prototype so as to evaluate preliminary performance. Table 1 shows the results. The programs were writ-ten in assembly language for the EM-4, in C for the SPARC330 and the VAX8800, and in FORTRAN for the CRAY/X-MP, and were compiled with the high-est level optimization. The results are represented in three forms: (1) CPU time; (2) EM-4 absolute MIPS; and (3) ratio of computation time of other computers to EM-4’s. Since the EM-4 prototype can count the number of clock cycles and the number of executed in-structions independently of and concurrently with the program execution, the EM-4’s performance is accu-rately measured.

In the table, FIB(23) is a program which calcu-lates the 23rd Fibonacci number. It is an appro-priate program for measuring function call overhead. PRIME(65536) is a stream-oriented program which finds all prime numbers less than 216. PI(4000) is a program which calculates the first 4000 figures of n. SUM(65535) is a program which adds up integers from 1 to 65,535. It divides the set of numbers into 80 PEs in the beginning and they are added on each PE locally. MTRX(80X80) is a program which calculates the matrix multiplication which size is 80 x 80.

The EM-4 prototype is 12 to 94 times faster than the SPARC330, 30 to 225 times faster than the VAX8800, and 0.4 to 21 times faster than the CRAY/X-MP. Even though the EM-4 programs are written in assembly language, these results indicate that EM-4’s architecture is one of the strongest con-tenders for use in near future supercomputers.

Even though the Fibonacci function causes many

parallel activities, the performance is weak in

com-parison with the other programs. This is one of the reasons why we take this program as a benchmark for the evaluation of the following load baiancing meth-ods.

(4)

3 Load

Balancing

Methods

in the

EM-4

3.1 The

Met

hodology

of Load

Balancing

The load balancing mechanism can be divided into the following three methods. The first is the static dis-tribution method in which the PE to which a new task is allocated is decided statically. The allocation strat-egy cannot be influenced by the load of the PEs in the system. The second method is purely dynamic distri-bution in which the PE is decided depending upon the load. The third method is a mixture of the first two and may prove to be the practical solution.

The static distribution method is effective when the program behavior is known in advance, since there is no overhead of finding a PE to which a new task is allocated dynamically. Moreover, in the case of static distribution, the locality of the PEs can be considered for selecting the allocation strategy. This means that the distance between the caller PE and the callee PE can be one of the important factors in the static dis-tribution methods for network structures which have a non-uniform access time, such as the circular omega, mesh or hypercube.

It is impossible to estimate precisely the whole amount of processing for a program or for fractions of programs. This is the reason why a dynamic distri-bution method is needed, and there are some tradeoff in dealing with the benefits of static distribution and dynamic distribution.

We think that a hybrid of the static and dynamic methods is the better solution for a practical distribu-tion method.

The EM-4 can handle load distribution on three lev-els: instruction level, block or loop level, and function level. The instruction-level activities are represented as a dataflow graph and block-level activities can be represented as a strongly connected block on the EM-4. Since the structure of the dataflow graph and the strongly connected block in a function is fixed, they can be sssigned statically to a PE or a group of PEs so as to reduce the communication cost.

Although the EM-4 can support fine-grain paral-lelism by dataflow synchronization, this paper will fo-cus on the function level coarse-grain parallelism. The reason is that function-level activities are not as sta-ble, especially those for solving the irregular programs such as Monte Carlo simulations, symbolic problems, and AI problems. These problems will, in the near future, be much more important than they are now. Irregular problems can be implemented easily on the

EM-4, since the EM-4 can combine both fine-grain and coarse-grain parallelism at any level. So a dat aflow graph that includes strongly connected blocks in a function can be allocated to a single PE.

In the load balancing method described in this pa-per, functions are distributed mainly statically to the PEs in the system, with some dynamic distribution by detecting a local minimum PE and assigning a task to it.

3.2 Function

Call

Mechanism

The function call mechanism of the EM-4 has to support the parallel evaluation on PEs. An invoked function is executed on a PE. The function invocation in the EM-4 is executed as follows.

1.

2.

3.

4.

Reserve an operand segment in a PE

Set the correspondence between the operand seg-ment and the template segment

Send arguments

Send return address

The first and the second step are performed by sending a message (special packet) to one of the PEs to which the caller is going to assign a segment. The PE is determined statically or dynamically according to the load distribution methods as explained later. It takes two clock cycles in both of the first and the second step. Since the caller PE receives the number of operand segment which the callee PE is going to use, arguments are passed to the callee PE by sending packets to the fixed address in the operand segment. So the third step and the fourth step take only one cycle each.

3.3

Static

Function

Distribution

In the static function distribution, the relation be-tween the PE that calls the function (caller PE) and the PE that executes the function (callee PE) is de-fined before the program runs. This concerns the is-sue of locality. The network topology, the distance between the caller PE and the callee PE, the num-ber of arguments and return values, and the size of the function body should be considered to define the relation.

The circular omega network hss two circular-path divisions in each node. In the EM-4, one is used for grouping the PEs and is the path intra group. The other circular-path division contains inter group

(5)

Table 2: Estimation of Load

paths. Using this grouping, each PE is identified by the pair of the group address (GA) and the member address (CA), which we represent [GA,CA]. In Figure 1, the thick line is an intra-group circular path and all the GA’s on the line are the same. The thin line is an inter-group circular path and all the GA’s on the line are different from each other.

In the EM-4, several methods for static function distribution are defined: the RANDOM method, the ROUND method, the absolute PE number method, and the relative PE number method. The RANDOM method selects a callee PE by randomly generated numbers. The ROUND method selects a callee PE by the round-robin rule from every PE. The absolute method selects a fixed PE as a callee PE. Any relation can be used in this method. For example, in the cir-cular omega network, there is a loop that links every PE at one stroke and the relation from a PE to the next PE on this loop (NEXTO) can be defined by the absolute method.

The EM-4 has another method, a relative PE num-ber method. The relative method selects a callee PE by using a relative PE address such as the next PE in the group (N EXT1), the next PE out of the group (NEXT2), or the combination of NEXT1 and NEXT2. The absolute method can freely define the relation be-tween a caller PE and a callee PE, but it is trouble-some to define the relations on every PE. The relative method remains free of defining the relation of combi-nation of NEXT1 and NEXT2, and it is easy to define the relation because one relation is applicable in every PE.

3.4 Dynamic Function Distribution

The dynamic function distribution is based on the dynamic load balancing for each function activity. It must select the callee PE according to the information about how busy the other processors are. It needs some explicit communication to get this information.

Dynamic load balancing techniques are divided into two classes on the EM-4. One is based on the global minima and the other is based on the local minima [5]. The former scheme can find a minimum load PE in a system at a certain time. It can be seen in the schemes using load-balancing switches like SIGMA-1 [4], PIE-64 [8], and Flagship [3]. This works by propagating the load information backward through the switching network and can be adapted for a wide range of par-allel machines which have a uniform network, such as a multistage interconnection network. The local min-ima scheme can be adapted for parallel machines with network structures which have a non-uniform access time, such as the circular omega, mesh or hypercube. There is, however, a potential problem that causes idle PEs to be ignored in favor of PEs with local minimal load searching. The EM-4 prototype uses this scheme for adding a dynamic load balancing feature to a static load balancing mechanism as described below.

3.4.1 The Factor of Load

In the EMC-R, the estimation of load is automati-cally generated by hardware according to the status of the PE as shown in Table 2. In this table, FIFO shows the number of packets in the on-chip input buffer, MIB shows the number of packets in the off-chip in-put buffer, ESEN shows the number of free operand segments for function invocations, and ESTN shows the number of free structured memories. In this esti-mation, the number of free operand segments and free structured memories are used only for the detection of critical conditions such as the exhaustion of resources. The EM-4 estimates the load mainly by the usage of the input packet buffer. This is because that the more as the number of allocated functions increases, the number of input packets increases.

(6)

CALLEE PE[02] c c d d e e f f g g

[LD,[GA,CA]] is the MLPE packet which shows that PE[GA,CA] has the minimum load LD

Figure 3: How to Detect the Minimum Load PE

3.4.2 Support for Load Balancing on Circular Omega Network

The EM-4 uses the special packet MLPE to detect the locally minimum load PE. When the MLPE packet is sent, the load and the address of the caller PE are set in the data field of the packet. The destination address of the MLPE packet is the caller PE itself. It goes out from the PE, and returns to the PE through the circular path in the network as shown in Figure 3. During the circular path, this packet is compared to the load of each PE, and if the load of the PE is less than the load of the packet the data field of the packet is changed to its load and its PE number. This comparison and change operation is done in a clock cycle for packet transfer. This method has no overhead except elapsed time for rounding the circular path.

This delay is small, but to reduce this overhead further, the EM-4 provides the prefetch mechanism. This is done as follows. ( I) The MLPE packet is sent ahead of a function call. (2)An operand segment num-ber (OSN), which is the function instance identifier, is requested to the minimum-load PE detected by the MLPE packet. (3) The OSN is stored in the special register SQ. When the function is called, the argu-ments are sent to the function instance indicated by the special register SQ, and the MLPE packet is sent for the next function call.

ods using the MLPE packets, attention must paid to the following conditions. First, this method detects the minimum-load PE not from all PEs but from those PEs that are in the circular path. Second, the prefetch data may not be in time for the next function call. Third, the data in the prefetched register does not show the current minimum-load PE. In spite of these conditions this method works well, as described in the following subsection.

4 Preliminary

Evaluation

of Load

Dis-tribution

For the preliminary evaluation of the load balanc-ing, the recursive Fibonacci program is executed and evaluated using the above mentioned static and dy-namic function distribution methods. The Fibonacci program has many function calls and since its execu-tion time depends on the selection of the callee PE, this can be used as a fundamental benchmark test for preliminary evaluation. For the evaluation of the dy-namic methods by the more practical program, the game tree searching program for checker game is exe-cuted and evaluated.

4.1 Evaluation of Static Function Distri-butions

The following six methods are evaluated for static function distribution by fibonacci program.

NEXT1-2 allocates one callee function to NEXT1 PE and the other to NEXT2 PE.

RANDOM allocates two functions to the PEs ran-domly selected from all the PEs.

ROUND allocates two functions to the PEs selected by the round-robin rule from all the PEs.

LRANDOM allocates two functions to the PEs ran-domly selected from the PEs on two circular paths.

LROUND allocates two functions to the PEs se-lected by the round-robin rule from the PEs on two circular paths.

NEXTO allocates one functions to NEXTO PE and the other to NEXTO’S NEXTO PE

(7)

Table 3: Results of Static Function Distributions

n

I

Execution i Active PE H

Methods Time (ins) rat io (70)

1 NEXT1-2 1.12 .54.6 u 2 RANDOM 1.98 44.0 3 ROUND 1.99 40.1 4 LRANDOM 1.49

_,

62.1 5 LROUND 1.37 62.7 6 NEXTO 8.44 1 8.6

Table 4: Result of NEXTO method using MLPE

static MLPE “ Execution Time(ms) 8.44 2.63

Active PE (%) 8.6 26.3

u ./ ,

a randomly generated number represents the absolute PE number, it may occur that many requests are con-centrated on a specific PE. To prevent this hot spot, the randomly generated number represents the rela-tive number from the caller PE. This relative scheme is used for the ROUND method, too. The LRAN-DOM and the LROUND method use the tables that represent the next PE in the intra and inter group.

The execution result of the fib(20) is shown in Table 3. In the table, the active PE ratio represents the average activity of all PEs while the program runs. This table shows the average result in ten trials, since the execution time varies depending upon the initial random seed or initial round pointer.

In this evacuation, NEXT1-2 method is the fastest, with LROUND and LRANDOM second and third. The Fibonacci program expands just like a tree struc-ture and the circular omega network can be fitted to this tree structure. Since the above methods use the network structure, they make the load well-balanced.

The NEXTO method is eight times slower than other methods. Since this method does not match well with the program structure of the Fibonacci, there may be imbalance of the function allocation to the PEs.

As for the active PE ratio, the LROUND method is better than the NEXT1-2 method. Nevertheless the NEXT1-2 method is faster because the number of operations in the LROUND method is larger. The NEXTI-2 method uses a constant value to select a

and rewriting the pointer. For large functions, the LROUND method may be better than the NEXTl-2 method because the larger the size of a function, the smaller the difference of overhead between each method.

Program performance depends on the static func-tion distribution method. It is important to select a method that adapts the program structure to the net-work topology in order to increase the performance. The evaluation results show that the static function distribution does not work well when the topology of the program fails to match the topology of the hard-ware structure, aa in the case of NEXTO. If the behav-ior of a program is known in advance, the allocation of the parallel program activities show good performance by using static function distribution. Unfortunately, this may not occur for general programs. This sug-gests that some kind of dynamic load balancing mech-anism is needed to compensate for such an unbalanced load.

4.2 Evaluation of Dynamic Function Dis-t ribuDis-t ions

Dynamic function distribution using MLPE meth-ods is evaluated by the Fibonacci program and the checker program on the EM-4 prototype.

4.2.1 Evaluation by Fibonacci Program

As mentioned above, in the dynamic function distri-bution using MLPE method the prefetching of the operand segment number (OSN) is carried out. If the OSN prefetched by the MLPE method exists the callee PE is selected by the OSN, but if not the callee PE is selected by the static distribution method. For this reason, the selection of the static distribution method affects the speed of execution by the MLPE method. The following two cases are examined in this paper.

case 1 Combines a dynamic distribution to the static method NEXTO.

case 2 Combines a dynamic distribution to the static method NEXT1-2.

The results of case 1 are shown in Table 4. Ta-ble 4 shows execution time and the percent of active PEs. The results show that the MLPE method is three times faster than the static method. The MLPE method is very useful when the load balance cannot be made equal only by the static method.

(8)

active PE

fib(24)

0:0 20.0 40.0 66.0 80.0

dOck.rxlo3

Figure 4: Active PEs using NEXT1-2

activates more PEs than the static method, and has less fluctuation in the number of active PEs. On aver-age, 79.6~o of all PEs are active in the MLPE method, with only 62.4% in the static method. This shows that the MLPE method makes the load on each PE more equal.

Moreover, to check the load balance, the equality of load in each PE is examined. In the EM-4, the load is estimated mainly by the number of input packets in the input buffer aa described in 3.4.1. Figure 5 shows the maximum, minimum, and average number of packets in the MIB when using the MLPE method and when using the static method. In this figure, the minimum number of packets in the static method can-not be seen because it is hidden by the x axis. That is, there are some PEs which do not have packets in the MIB in every clock cycle. In comparison to this, the minimum number of packets in the MLPE method is about half the maximum, except at the end of the pro-gram. This shows that the MLPE method haa more equality of load.

Figure 6 shows the usage of MLPE packets. This figure shows that a large number of MLPE packets are used at the beginning of the program execution and only a small number of MLPE packets are used at the end. According to the table in the Figure 6 only 5% of all the function invocations use the MLPE methods.

fib(24)

words x103 a=cL5H= 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 60 20.0 4ri:o 60.0 80;0 ~ clocksx 10

Figure 5: Usage of MIB

until other packets are operated on when the load of the system becomes heavy. The long turn-around time of the MLPE packets means that each packet is used less. To increase the usage of the MLPE packets, a few improvements must be considered. One possibility y is that MLPE packets will be operated in high priority. Another one is to use plural MLPEs for a PE. These improvements are being studied for the next version of the system.

Table 5 and Figure 7 shows what kind of function units are active in the MLPE method. In this figure, the thin line shows the number of PEs in which the execution unit (EXU) or the fetch and matching unit (FMU) is active. The gap between the dotted line and the thin line shows the number of PEs in which the input buffer unit(IBU) only is active, and the gap between the thick line and the dotted line shows the number of PEs waiting for sending packets due to the network conflict. On average, the percent of PEs in which only the IBU is active in the MLPE method is about twice that in the static method, aa shown in Table 5.

The state in which the IBU only is active indicates that the input packet buffer on a chip (FIFO) is in a state of overflow or underflow. lVhen the MLPE method is used, the load is well balanced and packets are distributed to each PE uniformly. This uniformity

(9)

MLPE

fib(24)

400 350 r 1 function call 46367 3m (using MLPE 2216) 250 L / 200 150 100 50 \ o 0.0

20.0

40.0

60.0

80.0 .

clocks%

103

Figure 6: Usage of MLPE packet

Table 5: Ratio of Active Function Units

Active Units static MLPE

PE(%) 61.4 79.6

EXU or FMU(%) 44.1 45.3

IBU only(%) 17.3 34.3

~ Network co~flict(%) I 9.2 I 13.4 n

flow and some PEs has no packet, and as the result the total overhead increases. This is because the size of a function of the fib is small, the number of the function is many, and the packet frequency is high. The result using MLPE is a special case and another program such as bellow has about 10 Yo overhead of the IBU.

4.2.2 Evaluation by Checker Game Program

To evaluate the dynamic function distribution by a more practical program than fibonacci, the game tree searching program for a checker game is evaluated. The program selects the best move according to a min-max search algorithm for the game tree. When a possible move is found, the function which makes the move and finds a next move is distributed in another PE.

active PE

fib(24)

O.i)

20:0

40’.0

60’.0

80:0

d0ckSxlo3

Figure 7: Active Function Units in the MLPE method

active PEs using both distribution strategys is shown in Figure 8. These programs searched the game tree from the initial situation of checker to a depth of 6 in parallel. In this figure, the MLPE method reduces the total execution time about 10 Yo and increases the ac-tive PEs 10 ?40compared with the static method. This shows the effectiveness of MLPE method not only for a simple Fibonacci program but for a practical pro-gram.

5 Conclusion

The function distribution methods and their evalu-ations with sample programs on the EM-4 have been presented in this paper.

In the static function distribution, the most ade-quate method should be adapted to both the program structure and network topology, because the program performance seriously depends on these adaptations. In the case of Fibonacci program, which is expanded to the binary tree structure, the tree-like local alloca-tion in the circular path in the circular omega network is effective. We are investigating other types of pro-gram structures such as more general tree structures, stream structures and so on.

(10)

active PE

checker

References

0!0 0!5 1!0 1!5 J.o ;.5

clocksxlo6

Figure 8: Active PEs on the checker program

the amount of packets in an input buffer on each PE. The evaluation results show that the MLPE method distribution makes the load on each PE more equal to-tally. Moreover, the MLPE method is a good backup for the static method, when the static method fails to balance the load.

As to future work on the prototype, many other application programs should be checked and effective function distribution methods to each program should be searched. Moreover, the methodologies should be embedded into the high-level language compiler and the method should be selected automatically.

As for work toward the next version of the EM-4, the following must be closely examined: (1) priority control for the MLPE packets; (2) increase of capac-ity of the special register where prefetched operand-segment numbers are buffered; and (3) improvement of IBU to reduce overhead.

Acknowledgements

We wish to thank Dr. Toshitsugu Yuba, Director of the Computer Science Division, Mr. Toshio Shimada, Chief of the Computer Architecture Section for

sup-[1] [2] [3] [4] [5] [6] [7] [8] [9]

Coffman,E.G., Jr.(Ed): Computer and Job-Shop Scheduling Theory, John Wiley & Sons, N.Y. (1976)

Wang,Y. and Morris,R.J .T.: Load Sharing in Dis-tributed Systems. IEEE Trans. on Computers, VO1.C-34, No.3, pp.204217 (1985).

Watson, I., Woods, V., Watson, P., Banach ,R., GreenbergJM. and Sargeant ,J.: Flagship: A paral-lel architecture for declarative programming, Proc. of ISCA88, pp.124-130 (1988).

Hiraki,K., Sekiguchi,S. and Shimada,S.: Load scheduling schemes using inter-PE network, Sys-tems and Computers in Japan, VO1.18, No. 1 (1987).

Keller, R. M., Lindstrom,G. and Patil,S.: A Loosely-Coupled Applicative Multiprocessing Sys-tem, NCC, VO1.48, pp.612-622 (1979).

Kodama,K., Sakai,S., Yamaguchi,Y.: A Prototype of a Highly Parallel Dataflow Machine EM-4 and its Preliminary Evaluation, Proc. of InfoJapan90, pp.291-298 (1990).

Papadopoulos,G.M. and Culler,D.E.: Monsoon: an Explicit Token-Store Architecture, Proc. 17th Annual SYmp. on Comput. Arch., pp. 82-91 (1990).

Sakai, S.: Interconnection Networks in a Highly Parallel MIMD Computer, Doctoral Dissertation, University of Tokyo (1985).

Sakai,S., Yamaguchi,Y., Hiraki,K., Kodama,Y. and Yuba,T.: An Architecture of a Dataflow Sin-gle Chip Processor, Proc. of ISCA89, pp.1155-1 160 (1989).

[10] Sakai,S., Kodama,K., Yamaguchi,Y.: Design and Implementation of a Versatile Interconnection Net-work in the EM-4, to appear in ICPP91.

[11] Yamaguchi,Y., Sakai,S., Hiraki,K., Kodama,Y. and Yuba,T.: An Architectural Design of a Highly Parallel Dataflow Machine, Proc. of IFIP 89, pp.1155-1160 (1989).

[12] Yamaguchi,Y., Sakai,S. and Kodama,Y.: Syn-chronization mechanisms of a highly parallel dataflow mchine EM-4, IEICE Trans, VO1.E74,