A Multi Objective Genetic Algorithm for Program Partitioning and Data Distribution Using TVRG

(1)

A Multi–Objective Genetic Algorithm for Program

Partitioning and Data Distribution Using TVRG

Masami Takata ∗_{, Tomomi Yamaguchi} †_{, Chiemi Watanabe} †_, Yoshimasa Nakamura∗ ‡_{, and Kazuki Joe} †

∗ _{PRESTO, Japan Science and Technology Agency} Kyoto University

Kyoto-city, Japan † _{Graduate School of Humanity and Science}

Nara Women’s University Nara-city, Japan

‡ _{Graduate School of Informatics} Kyoto University

Kyoto-city, Japan

AbstractWe propose an algorithm that per-forms data distribution and parallelization simultaneously. The objectives of the simul-taneous algorithm are to reduce the length of critical path and the total memory size. Re-gardless to say, memory usage for each pro-cessor must be balanced. To obtain an opti-mal solution, we first adopted a branch and bound method. Since the branch and bound method often fails in the case of a large task graph, we adopt a multi–objective genetic al-gorithm, that provides a near optimal solu-tion. For effective simultaneous partition-ings, we employ some edge sorting and or-dering methods. The effectiveness of our si-multaneous partitioning algorithms is shown by experimental results.

Keywords: Data distribution, Program partition-ing, Parallel program, TVRG, Genetic algorithm, Multi–objective

1 Introduction

To analyze physical and natural phenomena, numerical simulations are used in general. Large–scale computations and memory capac-ities are inevitable for the numerical calcula-tions. Even when computer hardware and sys-tem software are considerably improved, the numerical calculation requires so huge compu-tational resources that computer systems of uni–processor and uni–module of memory can not reasonably perform the huge computation. Parallel and distributed processing is a natu-ral method to overcome the resource problem. However, it is very difficult for typical numer-ical simulation programmers to perform

ade-quate transformation from a given sequential program to a parallel form, because it requires special knowledge of parallelization. Thus, an automatic parallelizing compiler is required to compensate for the lack of their programming skill.

We are working for the development of an automatic parallelizing compiler PROMIS-NWU [10] for distributed memory environ-ments. It is an extension of PROMIS [8] de-veloped at UIUC, which is also an automatic parallelizing compiler for shared memory envi-ronments.

To design a parallelizing compiler for dis-tributed memory environments, data partition-ing and its assignment should be optimized as well as program partitioning of parallelization

1_{. Both program partitioning and data}

dis-tribution are known as an NP-complete prob-lem [1]. When just the parallelization is per-formed, some processors may have to access the remote data which other processor owns. In this case, communication overhead arises, and the elapsed time often and terribly increases. On the other hand, when just data partition-ing and its assignment are considered, paral-lelization may not be carried out at all. There-fore, both problems should be solved simulta-neously.

TVRG (Task and Variable Representation Graph) [4], that represents program flows and data dependence of given programs, is a UIR (Universal Intermediate Representations) for PROMIS-NWU. To extract essential

informa-1_{In this paper, we use program partitioning as the} same meaning of parallelization

(2)

tion from TVRG and transform it to an acyclic weighted directional task graph, the paral-lelization and distribution problem can be re-garded as a graph partitioning problem. The graph partitioning requires huge memory ca-pacity to be solved besides long calculation time, because a combinatorial explosion oc-curs in the case of a large task graph parti-tioning. In our previous works for program partitioning [5] [6], large task graphs can not be partitioned by branch and bound based ap-proaches. Therefore, we assume that simi-lar approaches for simultaneous partitioning of parallelization and data distribution to large task graphs also require too much computa-tional resources. We here propose a GA (Ge-netic Algorithm) based simultaneous partition-ing algorithm which provides a near optimal solution to given large task graphs. Note that we have proposed a GA based program parti-tioning algorithm [7].

For an appropriate GA based algorithm of simultaneous program partitioning and data distribution, we use a multi–objective GA.

In section 2, we explain transformation def-initions from the TVRG information to a task graph, and describe simultaneous partitioning. In section 3, we propose a BB–SP–algorithm (branch and bound based simultaneous parti-tioning algorithm), that provides the optimal solution, and an MOGA–SP–algorithm (multi-objective GA based simultaneous partitioning algorithm), that provides near optimal solu-tions. In section 4, we evaluate the results of the proposed methods.

2 Program Partitioning and

Data Distribution

Parallelizing compilers possess a UIR as the information of program flows and data depen-dence. To partition and distribute programs and data, some information of the UIR is trans-formed to a task graph.

In this paper, we use TVRG [4], which has been implemented in PROMIS-NWU [10], as the UIR. In TVRG, program flows are trans-formed into an acyclic weighted directional graph. Flow dependence and output depen-dence of data are also obtained as the node number information in the graph. Combining the acyclic weighted directional graph and data dependence, a task graph for our simultaneous partitioning algorithm is created.

In subsection 2.1, we explain the

transforma-Table 1: An example of Arn.

array name

width arrayA arrayB arrayC

m=0 m=1 m=2

start of the width (l=0) 5 3 -1

end of the width (l=1) 10 3 -1

tion definitions from TVRG to the task graph. In subsection 2.2, we describe the definition and objectives for our simultaneous partition-ing algorithm.

2.1 Definitions for Task Graphs

Program flows and data dependence are repre-sented as an acyclic weighted directional task graph G = (N, E), where N and E indicate the sets of all nodes and edges of the task graph, respectively. Each n ∈ N corresponds to a task (a basic block) of a given program, and is assigned to a positive number (starting from 1). A function t(n) gives the cost required to execute the node n. A two dimensional array Arn(l, m) expresses the range of an array

ele-ments accessed in the node n, where l shows the start (l = 0) and end (l = 1) element num-ber and m indicates the array (the mth _array

accessed in the node). Therefore, the array Ar_n has (the number of arrays)∗2 elements.

The table 1 shows an example of the ar-ray access range Arn consisting of 3 ∗ 2

ele-ments. In the table 1, the node n accesses the element range from the 5th _{to the 10}th _of

ar-rayA. The node n also uses the 3th _element

of arrrayB, and does not use arrayC. An edge e = (ni, nj) ∈ E (ni < nj) indicates the

exis-tence of data dependency from node nito node

nj, and the function c(e) gives the

communi-cation cost for the edge e, when node ni and

nj are assigned to separate processors.

2.2 Definitions for Simultaneous

Partitioning

In our simultaneous partitioning algorithm, the number of possible edge conditions is three: 1) Inter-PE (Inter Partitioning Edge) where the nodes connected to the edge are assigned to different processors, 2) Intra-PE (Intra Parti-tioning Edge) where the nodes connected to the edge are assigned to the same processor, and 3) U-E (Unexamined Edge) where the edge has not been examined whether it is Inter-PE or Intra-PE.

When a set of node Ni = {n1, n2, ...} whose

edges are Intra-PE is assigned to the same pro-cessor, their communication costs are

(3)

consid-Inter-PE

Intra-PE

ni

nj

Figure 1: Example of deadlock

ered as zero. When ni ∈ Ni are merged, the

total communication cost is given as t(Ni) =

P

ni∈Nit(ni). The array access range is

cal-culated as ArNi(0, m) = minni∈NiArni(0, m)

(except Arni(0, m) = −1), and ArNi(1, m) =

max_n_i_∈N_iAr_n_i(1, m). In the case that a set of Inter-PE edges Ei = {ei1, ei2, ...} of Ni

are connected to a node of Ni, the Ei is said

to be merged, and the cost is calculated as c(Ei) =Pei∈Eic(ei).

Let P = hn₁, n₂, ..., n_mi be a path composed of Inter-PEs and their sequentially connecting nodes (i.e. there is an edge between niand ni+1

where 1 ≤ i < m). The cost T_τ of P is given as Tτ(P ) =Pni∈Pt(ni)+

P

ei∈alledgeswithinPc(ei).

The critical path is defined as the path P so that Tτ(P ) is the largest among the whole

paths in the task graph.

The memory consumption of each node is calculated as mem(n) = P_j(Arn(1, j) −

Arn(0, j) + 1). Thus, the total memory

con-sumption in a task graph G = (N, E) is given as M em(G) = P_n∈Nmem(n). The maxi-mum difference between memory consumption at each node of G is defined as Ran(G) = (max_n∈Nmem(n)) − (min_n∈Nmem(n)).

The optimal solution must satisfy the fol-lowing objectives: 1) Shorten the critical path length (the number of nodes in the path) to reduce the parallel program execu-tion time, 2) Minimize the memory consump-tion M em(G), and 3) Minimize the maximum difference Ran(G). Note that parallelization experts always take the above objectives into consideration to develop effective parallel pro-grams for distributed memory computers.

Figure 1 shows the occurrence of deadlock. Obviously, the graph is against the condition of an ’acyclic graph’. Therefore, solutions with deadlock must be removed from search space.

3 Simultaneous

Partitioning

Algorithm

Program partitioning and data distribution are considered as a graph partitioning problem. Based on the graph partitioning that

deter-mines whether a U-E is an Inter-PE or an Intra-PE, the node assignment to each proces-sor is obtained. In conclusion, the graph parti-tioning is a problem to find a state of edges, which minimizes the critical path, M em(G) and Ran(G), among 2ε(G) _{candidates, where}

ε(G) indicates the number of edges. The opti-mal partitioning is obtained by a branch and bound method, because the search space is ex-pressed as a binary search tree. However, the BB–SP–algorithm may encounter a combinato-rial explosion of computation. Consequently, the BB–SP–algorithm can not give the opti-mal partitioning of task graphs with a lot of edges, even if the execution environment pro-vides high–speed computing and huge memory. GA based graph partitioning algorithms give near optimal solutions of the task having a lot of edges with practical computational re-sources [9] [7]. In the GA based graph parti-tioning, the genes correspond to the edges of a given task graph. The results provided by the GA methods are sensitive to the coding mechanism of the genes. Therefore, to obtain more effective partitioning, we have proposed an edge sorting rule [7].

In subsection 3.1 and 3.2, we propose a BB– SP–algorithm and an MOGA–SP–algorithm. In subsection 3.3, we explain the sorting rule and two ordering methods for the effective node numbering.

3.1 Branch and Bound based

Simul-taneous Partitioning

In simultaneous partitioning algorithms, some optimal solutions may not satisfy multi objec-tives. Therefore, we apply a BB–SP–algorithm to obtain three optimal solutions which satisfy each objectives. Note that just the BB–SP– algorithm is not applicable when the task graph is too huge [5] [6].

At the initial setting of the BB–SP– algorithm, it has not been decided whether each node should be merged or not. Then, branch operations are performed repeatedly to generate the list of partial configurations. A partial configuration is expressed as a tuple < S, O >, where S contains the state informa-tion of all edges as an array, and O indicates the values of each objective as an array.

The procedure of the BB–SP–algorithm is summarized as follows.

1. Generate a task graph G from TVRG.

2. Generate the initial configuration < S, O >, where each element of S is the U-E state, and the critical path obtained from O has the largest execution time.

(4)

3. Select a partial configuration < S0_{, O}0 _{> so that the}

minimum critical path obtained from O0_{is included.}

4. Examine whether each element of S0 _{is not the U-E}

state.

• If no U-E edge remain, G0 _{is the optimal}

parti-tioning from the viewpoint of the critical path. Go to step 8.

• If there are U-E edges, choose an edge e0_whose

state is U-E.

5. Generate a partial configuration < S01_{, O}01 _{> by}

transforming the edge e0_{state to Inter-PE.}

Create a graph G0_{with S}01_{from G.}

Calculate O01_{of G}0_.

Add < S01_{, O}01_{> to the partial configuration list.}

6. Generate a partial configuration < S02_{, O}02 _{> by}

transforming the edge e0_{state to Intra-PE.}

Create a graph G0_{with S}02_{from G.}

Examine whether G0_{has deadlock.}

• If G0 _{does not have deadlock, Calculate O}02 _of

G0_.

Add < S02_{, O}02 _{> to the partial configuration}

list. 7. Go to step 3.

minimum M em(G0_{) from O}0_{is included.}

state.

par-titioning from the viewpoint of the minimum

Mem(G0_).

Go to step 12.

state is U-E. 10. Execute the step 5 and 6. 11. Go to step 8.

minimum Ran(G0_{) from O}0_{is included.}

state.

par-titioning from the viewpoint of the minimum

Ran(G0_).

The algorithm is terminated.

state is U-E. 14. Execute the step 5 and 6. 15. Go to step 12.

3.2 Multi–Objective Genetic

Algo-rithm based Simultaneous Parti-tioning

In this subsection, we propose an application of an MOGA–SP–algorithm. The problem en-vironments are the same to the previous sub-section. Here we explain the MOGA–SP– algorithm related characteristics.

Chromosome: Each gene corresponds to an edge of a task graph. Hence, the length of the individual is ε(G). Since the edge condition is Inter-PE or Intra-PE, the individual {0, 1} corresponds to {Intra-PE, Inter-PE }.

Crossover: When multiple crossover points are employed, the ratio of generating the dead-locked offspring may increase. Therefore, we adopt a single crossover point with the rate of 0.75. Two individuals are selected from the ancestor at random. The crossover point is se-lected at random. 100(%) 80 60 40 20 0 0 1 2 3 4 5 6 7 8 9 10 (generation)

Sorting0O Ordering0A Ordering1A Figure 2: The ratio of the individuals with deadlock at each generation

Mutation: The mutation rate is 0.01. For the mutation operation, 5% of the individuals are selected and 20% of genes are flipped. Fitness value: Each individual has three kinds of fitness values.

One of the fitness values is the length of the critical path of given task graphs expressed by the chromosome 01. In the case that the given task graph has deadlock, which means the crit-ical path length is infinite, we regard the fit-ness value of the individual as P_n∈Gt(n) +

P

e∈Gc(e) + 1.

The other fitness values are M em(G) and Ran(G).

Since the fitness values are defined as the length of the critical path, M em(G) and Ran(G), an individual with smaller fitness val-ues is better.

Process: As the initial setting, X = ε(G)2

individuals are generated, and their genes are set 0 or 1 at random. At the update stage, the elitism strategy is adopted [3].

The procedure of an MOGA–SP–algorithm is summarized as follows.

1. Select the superior X/6 individuals about the criti-cal path length, M em(G) and Ran(G) from ancestors, and copy them into a set Y .

2. Generate X ∗ 3/2 individuals by a crossover operation, and move them to Y .

3. Perform a mutation operation to Y which consists of 2X individuals.

4. Calculate the fitness values of all the individuals of Y . 5. Select the superior X/3 individuals about the critical path length, M em(G) and Ran(G) from Y as the off-spring.

3.3 Sorting and Ordering

To make effective use of a branch and bound based partitioning algorithm, the list of edges (e = (ni, nj)) is sorted by t(ni) + t(nj) + c(e)

by descendent order [2].

On the other hand, when creating children by crossover operations, it is desired to keep them inherit the effective properties of their parents. Hence, we adopt Sorting 0O method proposed in [7].

(5)

Mem(G) CriticalPath Mem(G) CriticalPath Ran(G) CriticalPath 0 generation Ran(G) Mem(G) (a) (b) (d) Ran(G) 4300 3600 2900 2200 1500 470 380 290 200 4700₈₉₀₀ 13100₁₇₃₀₀ 21500 470 380 290 200 470 380 290 200 47008900131001730021500 4300 3600 2900 2200 1500 4300 3600 2900 2200 1500 4300 3600 2900 2200 1500 4300 3600 2900 2200 1500 470 380 290 200 470 380 290 200 470 380 290 200 470 380 290 200 470 380 290 200 470 380 290 200 1500 2200 2900 3600 4300 1500 2200 2900 3600 4300 4300 3600 2900 2200 1500 47008900131001730021500 47008900131001730021500 47008900131001730021500 47008900131001730021500 47008900131001730021500 1500 2200 2900 3600 4300 (c) 4700₈₉₀₀ 13100₁₇₃₀₀ 21500 4700₈₉₀₀ 13100₁₇₃₀₀ 21500 10 generation 5 generation

Figure 3: The fitness values for each chromosome without deadlock. (Using Sorting 0O.)

Sorting 0O

Edges e = (ni, nj) are sorted in the ascending order by ni,

and by njwhen the node nihas multiple incoming edges.

Since Sorting 0O depends heavily on the or-der of the nodes, we also use Oror-dering 0A or Ordering 1A [7].

Ordering 0A

1. Let max be the largest node number among all nodes. 2. • When max ≤ 2, the algorithm is terminated.

• When max > 2, large = max − 1.

3. Make a list In-E including ei= (ni, nmax): 1 ≤ i ≤

max − 1

4. Check whether incoming edges to the node nmaxexist.

• When In-E is empty,

reorder all nodes that are not the source of in-coming edges to the node nmaxstarting from 1.

Subtract 1 from max. Go to step 2.

• When In-E is not empty,

select and remove the ei, where niis the

small-est node number, from In-E. 5. Change nito nlarge.

Subtract 1 from large. Go to step 4.

Ordering 1A

1. Let max be the largest node number among all nodes. Let min be 1.

2. • When max = min, the algorithm is terminated. • When max 6= min, small = min + 1.

3. Make a list Out-E including ei = (nmin, ni): min ≤

i ≤ max.

4. Check whether outgoing edges to the node nminexist.

• When Out-E is empty,

Reorder all nodes that are not the source of

in-coming edges to the node nmin starting from

small.

Add 1 to min. Go to step 2.

• When Out-E is not empty,

select and remove ei, where ni is the smallest

node number, from Out-E.

5. Change nito nsmall.

Add 1 to small. Go to step 4.

Ordering 0A and Ordering 1A are an effec-tive method for the reference of incoming and outgoing edges, respectively.

4 Experiments

We performed experiments for the evaluation of MOGA–SP–algorithm.

Experimental environment is: Linux 2.4.7 (RedHat Linux 7.1.2), Pentium 4 (1.8GHz) with memory capacity of 1GB.

For the experiment in MOGA–SP– algorithm, the generation count is 10. An acyclic weighted directional task graph with 30 nodes and 29 edges are generated for the experiments, because BB–SP–algorithm can not be executed in the case of larger task graphs. Each edge is connected randomly so that the number of outgoing edges per node does not exceed five. Each node cost and edge cost are set to random numbers with a uniform distribution of [501, 1000]. The number of arrays used in the task graph is five, and each array has 100 elements.

As the result, the execution times for BB– SP–algorithm and MOGA–SP-algorithm are 6.8[sec] and 0.9[sec], respectively.

(6)

0 generation (a) (b) 4300 3600 2900 2200 1500 470 380 290 200 470 380 290 200 470 380 290 200 4300 3600 2900 2200 1500 4300 3600 2900 2200 1500 4300 3600 2900 2200 1500 4300 3600 2900 2200 1500 470 380 290 200 470 380 290 200 470 380 290 200 470 380 290 200 470 380 290 200 470 380 290 200 4300 3600 2900 2200 1500 4700₈₉₀₀ 13100 17300₂₁₅₀₀ 4700₈₉₀₀ 13100 17300₂₁₅₀₀ 4700₈₉₀₀ 13100₁₇₃₀₀ 21500 47008900131001730021500 47008900131001730021500 47008900131001730021500 47008900131001730021500 47008900131001730021500 47008900131001730021500 1500 2200 2900 3600 4300 1500 2200 2900 3600 4300 1500 2200 2900 3600 4300 (c) (d) 5 generation 10 generation Mem(G) CriticalPath Mem(G) CriticalPath Ran(G) CriticalPath Ran(G) Mem(G) Ran(G)

Figure 4: The fitness values for each chromosome without deadlock. (Using Ordering 0A.) In BB–SP–algorithm experiments, the

ob-jective values for the optimal solution for the critical path length are max Tτ = 4683,

M em(G) = 3873, and Ran(G) = 244. The optimal objective values for minimum Ran(G) are max Tτ = 7469, M em(G) = 3967, and

Ran(G) = 207. The optimal objective values for minimum M em(G) are max Tτ = 14835,

M em(G) = 1953, and Ran(G) = 440.

To reduce M em(G), the accessed element size of arrays shared by different processors must be small. In a typical task graph, the range of the array elements used in a node laps to the other nodes. To decrease the over-lapped area, those overlapping nodes should be merged in the same node. Consequently, the optimal solution for minimum M em(G) makes max Tτ increase because most of nodes

have been merged. Hence, simultaneous par-titioning algorithms must optimize the critical path length and the Ran(G) size, while the M em(G) size should be compromised by mem-ory capacity of each processor.

In MOGA–SP–algorithm experiments, we observe that individuals with deadlock exist at each generation. Figure 2 shows the ratio of in-dividuals with deadlock. In the figure, the hor-izontal axis means the number of generations, and vertical line corresponds the deadlock ra-tio (%). With all edge sorting methods, the

number of individuals with deadlock decreases according to genetic operations. Consequently, the superior individuals with shorter critical paths increase by offspring. Note that we could not observe that individuals with deadlock are eliminated by genetic operation.

Figure 3, 4, and 5 show distributions of in-dividuals without deadlock at each generation. In (a) of each figure, the horizontal axis, the vertical axis, and the depth axis represent the critical path length, the Ran(G) size, and the M em(G) size, respectively. In (b) and (c), the horizontal axis and the vertical axis rep-resents the critical path length and the size of M em(G) and Ran(G), respectively. In (d), the horizontal axis and the vertical axis are M em(G) and Ran(G), respectively.

In (c), the fitness values of each individual at the 10th_{generation are closer to the optimal}

than at the initial generation. In (d), M em(G) increases and Ran(G) decreases at the 10th

generation. Compared with the experimental result of BB–SP–algorithm, the optimal solu-tion for critical path is also an approximate solution for Ran(G), and the optimal solution for Ran(G) has smaller critical paths than for M em(G). Therefore, at the 10th generation, several individuals are similar to the optimal solutions by BB–SP–algorithm. Consequently, we find our MOGA–SP–algorithm is effective.

(7)

0 generation (a) (b) (d) (c) 4300 3600 2900 2200 1500 470 380 290 200 470 380 290 200 470 380 290 200 4300 3600 2900 2200 1500 4300 3600 2900 2200 1500 4300 3600 2900 2200 1500 4300 3600 2900 2200 1500 470 380 290 200 470 380 290 200 470 380 290 200 470 380 290 200 470 380 290 200 470 380 290 200 4300 3600 2900 2200 1500 4700 8900₁₃₁₀₀ 17300₂₁₅₀₀ 4700 8900₁₃₁₀₀ 17300₂₁₅₀₀ 4700 8900₁₃₁₀₀ 17300₂₁₅₀₀ 4700 8900 13100 17300 21500 1500 2200 2900 3600 4300 1500 2200 2900 3600 4300 4700 8900 13100 17300 21500 4700 8900 13100 17300 21500 4700 8900 13100 17300 21500 4700 8900 13100 17300 21500 4700 8900 13100 17300 21500 1500 2200 2900 3600 4300 5 generation 10 generation Mem(G) CriticalPath Mem(G) CriticalPath Ran(G) CriticalPath Ran(G) Mem(G) Ran(G)

Figure 5: The fitness values for each chromosome without deadlock. (Using Ordering 1A.) In Figure 3 (b), we observe that the initial

individuals have similar characteristics only when applying the Sorting 0O method. In gen-eral, the initial individuals in GA should start from difference conditions [3]. The more gen-erations, the smaller difference the variance by each ordering method is. At the 10th

gener-ation by Ordering 0A and Ordering 1A, the critical path length and the Ran(G) size de-crease, and M em(G) size increases. By using Ordering 0A or Ordering 1A, a lot of approx-imate solutions are observed. Thus, Ordering 0A and Ordering 1A methods are both effec-tive for MOGA–SP–algorithm.

5 Conclusions

In this paper, we proposed BB–SP–algorithm and MOGA–SP–algorithm for simultaneous partitioning of program partitioning and data distribution. When the BB–SP–algorithm is executed with a typical computer, we obtained the optimal solution in the case of task graphs with 30 nodes and 29 edges. Using the MOGA– SP–algorithm, a larger task graph can be par-titioned approximately. We also validated Or-dering 0A and OrOr-dering 1A as effective order-ing methods for the MOGA–SP–algorithm.

For the future works, we should investigate other possible crossover, mutation, and fitness value strategies for more effective solutions.

References

[1] M. R. Garey, D.S. Johnson. Computers and

Intractabil-ity, A Guide to the Theory of NP-Completeness. W.

H. Freeman and Company, San Francisco, California, 1979.

[2] M. Girkar, C. D. Polychronopoulos. Partitioning Pro-grams for Parallel Execution. Proc. of the 1988 Int.

Conf. on Supercomputing, pp.216–229, 1988.

[3] D. E. Goldberg. Genetic Algorithms in Search,

Op-timization, and Machine Learning. Addison-Wesley,

1989.

[4] M. Haneda, H. Shouno, K. Joe. An Intermediate Rep-resentation for Parallelizing Compilers for DSM Sys-tems. The International Workshop on Advanced

Com-piler Technology for High Performance and Embedded Systems, pp.47–55, July, 2001.

[5] M. Takata, Y. Kunieda, K. Joe. Accelerated Program Partitioning Algorithm - An Improvement of Girkar’s Algorithm-. Proc. of the Int. Conf. on Parallel and

Distributed Processing Techniques and Applications,

Vol.II, pp.699–705, June, 2000.

[6] M. Takata, Y. Kunieda, K. Joe. A Heuristic Approach to Improve a Branch and Bound based Program Parti-tioning Algorithm. The proceedings of the 1999

Inter-national Workshop on Innovative Architecture, pp.105–

114, November, 2000.

[7] M. Takata, H. Shouno, K. Joe: An Improvement of

Program Partitioning Based Genetic Algorithm. The

proceedings of the International Conference on Paral-lel and Distributed Processing Techniques and Applica-tions (PDPTA ’02), Vol.I, pp.215–221, June, 2002.

[8] H. Saito, N. Stavrakos, C. D. Polychronopoulos, A. Nicolau. The Design of the PROMIS Compiler. Int’l

J. of Parallel Programming, Vol.28, No.2, pp.195–212,

2000.

[9] T. Saito, T. Nakanishi, Y. Kunieda, A. Fukuda. Ge-netic Algorithm Based Program Partitioning. Proc. of

the Int. Conf. on Parallel and Distributed Processing Techniques and Applications, Vol.II, pp.707–712, June,

2000.

[10] T. Yamaguchi, H. Ishiuchi, A. Iwasaka, M. Haneda, H. Shouno, K. Joe. The Overview of the Paralleliz-ing Compiler PROMIS-NWU. Proc. of IPSJ SIGARC