CiteSeerX — Nested Loop Transformation for Full Parallelism

(1)

Nested Loop Transformation for Full Parallelism

Nelson Luiz Passos Edwin Hsing-Mean Sha Research Report CSE-TR-94-011

April 1994

Abstract

Transformation techniques are usually applied to get optimal execution rates in parallel and/or pipeline systems. The retiming technique is a common and valuable tool in one- dimensional problems. Most scientic or DSP applications are recursive or iterative. Uni- form nested loops can be modeled as multi-dimensional data ow graphs (DFGs). To achieve full parallelism of the loop body, i.e., all the computational nodes executed in parallel, sub- stantially decreases the overall computation time. It is well known that for one-dimensional DFGs retiming can not always achieve full parallelism. This paper shows an important and counter-intuitive result, which proves that we can always obtain full-parallelism for DFGs with more than one dimension. It also presents two novel multi-dimensional retiming techniques to obtain full parallelism. Examples, description and the correctness of our algorithms are presented in the paper.

(2)

1 Introduction

Applications such as image processing, uid mechanics, and weather forecasting, require high computer performance. Researchers and designers in those areas are looking for solutions to multi-dimensional problems through the use of parallel computers and/or specialized hard- ware, which justies the study of multi-dimensional optimization algorithms. Ecient transformation mechanisms, for parallel and/or pipelined processing of one-dimensional problems represented by Data Flow Graphs, have been proposed in previous studies using retiming techniques [16]. The retiming technique regroups operations in iterations in order to produce a new iteration structure with higher embedded parallelism [4, 21]. An equivalent approach to the multi-dimensional case will benet parallel compiler design for VLIW, data- ow, and superscalar architectures [1, 10, 14], as well as high-level synthesis for VLSI design [3, 18].

Computation-intensive applications usually depend on time-critical sections consisting of a loop of instructions. To optimize the execution rate of such applications, the designer needs to explore the parallelism embedded in repetitive patterns of a loop. Some research has been done on uniform nested loop scheduling, a similar view to a multi-dimensional problem. For example, unimodular transformations [25], loop skewing [24] and loop quantization [2]. These techniques dier from our method since they don't change the structure of iterations, but the sequence in which the instructions are executed.

Other techniques that search beyond loop boundaries include perfect pipelining [1] and Doacross loops [7]. Perfect pipelining looks for a repeating pattern for a single loop with the disadvantage of unpredictability of the size of such pattern and the gain of performance.

Doacross loop presents an overlap of consecutive iterations that contributes to the improve- ment of the execution rate. These two methods can be regarded as special cases of our technique.

In our study, loop bodies are represented by multi-dimensional data ow graphs (MDFG).

Instead of simply overlapping iterations, we restructure the loop body, i.e., the existing dependence delays, preserving the original data dependence. We model the process of restructuring as a multi-dimensional retiming. The new MDFG is constructed from the original graph through the retiming function. The retiming process and loop pipelining are, in essence, the same concept [5]. Previous research in loop pipelining for loops with cyclic dependencies produced methods which focus on one-dimensional problems. Such methods appear in several systems [11, 15, 20, 23].

In the one-dimensional retiming method, negative delays represent the existence of a cycle in the execution sequence and can not be allowed. However, on the multi-dimensional case, the existence of negative delays is to be considered natural if there exists a schedule

s that turns the graph realizable. Chao and Sha introduced the concept of a restricted multi-dimensional retiming [4]. Their algorithm can not guarantee to achieve full parallelism and is only applicable to a specic class of MDFGs, where any cycle would have a strictly non-negative total multi-dimensional delay. In this paper, two new algorithms are presented:

1

(3)

M8 A4 A6 A5

M7

M6 A3

M5

M4 A2

M3

M2 A1

M1 A7

(2,2) (2,1)

(2,0)

(1,2)

(1,1) (1,0) (0,2) (0,1)

(a)

DO 10 n1=1,N1 DO 20 n2=1,N2

y(n1,n2) = x(n1,n2) + c(0,1)*y(n1,n2-1) + - c(0,2)*y(n1,n2-2) + c(1,0)*y(n1-1,n2) + - c(1,1)*y(n1-1,n2-1) + c(1,2)*y(n1-1,n2-2) + - c(2,0)*y(n1-2,n2) + c(2,1)*y(n1-2,n2-1) + - c(2,2)*y(n1-2,n2-2)

20 CONTINUE 10 CONTINUE

(b)

Figure 1: (a) MDFG representing an IIR lter (b) equivalent Fortran code

incremental multi-dimensional retiming and chained multi-dimensional retiming. The rst uses a legal retiming to successively restructure the loop body represented by a general form of MDFG. The second improves the eciency of the rst, by obtaining the nal solution in only one pass. Both techniques achieve full parallelism.

Full parallelism is the simultaneous execution of all operations (nodes) in a MDFG. There- fore, to obtain full parallelism is equivalent to obtain non-zero delay in all edges of the MDFG.

In the one-dimensional case, such a condition can not always be achieved through retiming due to the constancy of the sum of delays in a cycle.

For simplicity, we use two-dimensional problems without instructions between loops as examples. The two dimensions are generically referred as ^x and ^y. The multi-dimensional case is a straight forward extension of the concepts presented. The main example consists of an IIR lter (Innite Extent Impulse Response) [8], represented by the transfer function:

H(^z¹^;^z²) = ¹

1;

P

2

n

1

=0 P

2

n

2

=0 c(n

1

;n

2 )z

;n

1

z

;n

2

which can be translated in

y(ⁿ¹^;ⁿ²) =^x(ⁿ¹^;ⁿ²) +^P²^k¹⁼⁰^P²^k²⁼⁰^c(^k¹^;^k²)^y(ⁿ¹^;^k¹^;ⁿ²^;^k²) for ^k¹^;^k² ⁶= 0.

The MDFG derived from the equation above is shown in gure 1(a). An equivalent Fortran code is presented in gure 1(b). The retimed graph is presented in gure 2(a), where the two-dimensional delay (^;1^;1) is pushed through all nodes labeled ^M on the original MDFG. Intuitively, the retimed nodes no longer precede the additions ^A1 to ^A4 within the same iteration, which allows those operations to be executed simultaneously. This new characteristic reduces the critical path of the graph, and consequently the overall execution time. Figure 2(b) shows the result from a new retiming operation, where the two-dimensional

2

(4)

M8 A4 A6 A5

M7

M6 A3

M5

M4 A2

M3

M2 A1

M1 A7

(1,0) (1,1) (2,-1) (2,0) (2,1) (3,-1) (3,0)

(-1,1)

(a) (3,1)

(-1,1)

(-1,1) (-1,1)

(-1,1)

M8 A4 A6 A5

M7

M6 A3

M5

M4 A2

M3

M2 A1

M1 A7

(1,0) (1,1) (2,-1) (2,0) (2,1) (3,-1) (3,0)

(2,-1)

(b) (3,1)

(2,-1)

(2,-1) (2,-1)

(2,-1)

(2,-1) (-3,2)

(-3,2)

Figure 2: (a) MDFG after retiming all nodes M (b) MDFG after retiming A1 to A4 delay (^;3^;2) was pushed through nodes^A1 to^A4, improving the parallelism and reducing the critical path furthermore. Full parallelism is achieved when all edges have non-zero delays.

In section 2, we show some basic concepts, such as mathematical models for Dependence Graphs and Data Flow Graphs. Also, we mention the basic ideas of multi-dimensional retiming and cycle period. Such concepts are useful on the following section when developing the algorithm for incremental retiming. Section 3 shows the properties necessary to have a legal retiming function. The incremental multi-dimensional retiming algorithm is described, giving us the foundations required for the main theorem that proves that full parallelism can always be achieved. Section 4 presents a more ecient algorithm which uses the concept of a chain reaction. Section 5 shows some examples of application of our techniques, followed by a nal conclusion that summarizes the concepts introduced in this paper.

2 Basic Principles

2.1 Background

In this section we present some concepts related to the interpretation of multi-dimensional data ow graphs, such as mathematical models, dependence vectors, iterations, and vector operations.

A multi-dimensional data ow graph (MDFG) ^G = (^V;Ê;^d;^t) is a node-weighted and edge-weighted directed graph, where^V is the set of computation nodes,Ê ^V ^V is the set of dependence edges, ^dis a function from Ê to ^Zⁿ, representing the multi-dimensional delay between two nodes, where ⁿ is the number of dimensions, and ^t is a function from^V to the

3

(5)

B (-1,1)

A D

C (1,1)

(a)

DO 10 j = 0,n DO 11 k = 0, m

D: d(k,j) = b(k+1,j-1) * c(k-1,j-1) A: a(k,j) = d(k,j) * .5

B: b(k,j) = a(k,j) + 1.

C: c(k,j) = a(k,j) + 2.

11 CONTINUE 10 CONTINUE

(b)

Figure 3: (a) MDFG extracted from a Wave Digital Filter (b) equivalent Fortran code positive integers, representing the computation time of each node. A two-dimensional data ow graph (2DFG) ^G = (^V;Ê;^d;^t) is an MDFG, where ^d is a function from Ê to ^Z². We use ^d(ê) = (^d:x;^d:y) as a general formulation of any delay shown in a two-dimensional DFG.

An example of a two-dimensional DFG and its equivalent Fortran code is shown in gure 3. For this example, ^V = ^fA;^B;^{C ;}^{D g} and Ê = ^fe1 : (Â;^B)^;ê2 : (Â;^C)^;ê3 : (^{D ;}Â)^;ê4 : (^B;^D)^;ê5 : (^{C ;}^D)^g where, ^d(ê1) = ^d(ê2) = ^d(ê3) = (0^;0), ^d(ê4) = (^;1^;1), ^d(ê5) = (1^;1).

For simplicity, each operation is assumed to be executed in one time unit, therefore, ^t(^A) =

t(^B) =^t(^C) =^t(^D) = 1.

Aniterationis the execution of the loop body exactly once, i.e., the execution of each node in V exactly once. Iterations are identied by a vector ⁱ, equivalent to a multi-dimensional index, starting from (0^;0^;^:^:^:^;0). Inter-iteration dependencies are represented by vector- weighted edges. For any iteration ^^j, an edgeêfromû to ^v with delay vector^d(ê) means that the computation of node ^v at iteration ^^j depends on the execution of node û at iteration

^

j ;d(^e). An edge with delay (0^;0^;^:^:^:^;0) represents a data dependence within the same iteration. A legal MDFG must have no zero-delay cycle, i.e., the summation of the delay vectors along any cycle can not be (0^;0^;^:^:^:^;0). Several techniques are available to verify that an MDFG does not have a cycle [6, 12].

An equivalentcell dependence graph of an MDFG G is the directed acyclic graph showing the dependencies between copies of nodes representing the MDFG. Figure 4(a) shows the replication of the MDFG in gure 3(a), and gure 4(b) shows the cell dependence graph with each node representing a copy of the MDFG. A computational cell is the cell dependence graph (DG) node that represents a copy of an MDFG, excluding the edges with delay vectors dierent from (0^;0^;^:^:^:^;0), i.e., a complete iteration. A cell is considered an atomic execution unit.

A mathematical model for a cell dependence graph expanded from an MDFG is presented in [17]: a tuple (^Iⁿ^;^D) where^Iⁿis the index set of the cell structure, equivalent to the vectors

4

(6)

0,0 1,0 0,1 1,1

(a) A B A B A B A B

C D C D C D C D

2,0 3,0 2,1 3,1

A B A B A B A B

C D C D C D C D

0,0 1,0 2,0 3,0 0,1 1,1 2,1 3,1 0,2 1,2 2,2 3,2 0,3 1,3 2,3 3,3

(b)

Figure 4: (a) DG based on the replication of an MDFG, showing iterations starting at (0,0).

(b) DG represented by computational cells

used to identify each iteration, and ^D is a matrix containing all dependence vectors (delay vectors). Consider, for example, the DG shown in Figure 4; its model will be described as:

I

2 =^f(^i;^j) : 0ⁱ3^;0^j 3^g, and ^D =

"

;1 1 1 1

#

Uniform loops are those that present the characteristic of constant dependence vectors, i.e., data dependencies are at a constant distance in the iteration space.

The notationû^;^!ê ^vmeans thatêis an edge from nodeûto node^v. The notationû ^p^;^v means that^pis a path fromûto^v. The delay vector of a path^p=^v⁰ ^;ê^!⁰ ^v¹ ^;ê^!¹ ^v²^:^:^:ê^;^{k ;1}^! ^v^k is ^d(^p) =^P^{k ;1}ⁱ⁼⁰ ^d(êⁱ) and the total computation time of a path ^pis ^P^kⁱ⁼⁰^t(^vⁱ).

To manipulate MDFG characteristics represented on vector notation, such as the delay vectors, we make use of component-wise vector operations. Considering two two-dimensional vectors ^P and ^Q, represented by their coordinates (^{P :x;}^{P :y}) and (^Q:x;^Q:y), an example of arithmetic operation is ^P +^Q = (^{P :x}+^Q:x;^{P :y}+^Q:y). The notation ^P ^Q indicates the inner product between ^P and ^Q, i.e.,^P ^Q=^{P :x}^Q:x+^{P :y}^Q:y.

A schedule vector s is the normal vector for a set of parallel equitemporal hyperplanes that dene a sequence of execution of a cell dependence graph. The existence of a schedule vector prevents the existence of a cycle. We say that an MDFG^G= (^V;^E;^d;^t) isrealizableif there exists a schedule vector ^sfor the cell dependence graph with respect to G, i.e.,^s^d0 for any ^d²^G [13].

5

(7)

B (-2,1) A D

C (0,1) (1,0)

(a)

DO 10 j = 0, n

d(0,j) = b(1,j-1) * c(-1,j-1)

(b)

}prologue

epilogue DO 11 k = 0, m-1

d(k+1,j) = b(k+2,j-1) * c(k,j-1) a(k,j) = d(k,j) * .5

b(k,j) = a(k,j) + 1.

c(k,j) = a(k,j) + 2.

11 CONTINUE a(m,j) = d(m,j) * .5 b(m,j) = a(m,j) + 1.

c(m,j) = a(m,j) + 2.

10 CONTINUE

}

Figure 5: (a) MDFG after retimed by r(D)=(1,0) (b) equivalent Fortran code

2.2 Retiming an Multi-Dimensional Data Flow Graph

In this subsection, we show the basic ideas on multi-dimensional retiming, including cycle period, critical path, and characteristics of legal retiming.

The period during which all computation nodes in an iteration are executed, according to existing data dependencies and without resource constraints, is called a cycle period. The cycle period ^C(^G) of an MDFG ^G= (^V;Ê;^d;^t) is the maximum computational time among paths that have no delay. For example, the MDFG in gure 3(a) has ^C(^G) = 3, which can be measured through the paths ^p=^D ^!Â^!^B or ^p=^D ^!Â^!^C.

A multi-dimensional retiming ris a function from^V to ^Zⁿthat redistributes the nodes in the original dependence graph created by the replication of an MDFG ^G. A new MDFG ^G^r is created, such that each iteration still has one execution of each node in ^G. The retiming vector^r(û) of a node û²^Grepresents the oset between the original iteration containing û, and the one after retiming. The delay vectors change accordingly to preserve dependencies, i.e.,^r(û) represents delay components pushed into the edgesû ^!^v, and subtracted from the edges ^w ^! û, where û;^v;^w ² ^G. Therefore, we have ^d^r(ê) = ^d(ê) +^r(û)^;^r(^v) for every edge û ^;^!ê ^v and ^d^r(^l) =^d(^l) for every cycle^l ²^G. After retiming, the execution of node û in iteration ⁱ is moved to the iteration ⁱ^;^r(û). A two-dimensional retiming r is a function from ^V to ^Z² that redistributes the nodes in the original dependence graph created by the replication of a 2DFG ^G, resulting in a new 2DFG^G^r, such that each iteration still has one execution of each node in ^G, reducing the cycle period as in the one-dimensional case. For example, gure 5(a) shows the MDFG from gure 3(a) retimed by the function ^r. Figure 5(b) shows the modied Fortran code. The critical paths of this graph are the edges Â^!^B and Â^!^C with an execution time of 2 time units.

The retimed cell DG, for the example in gure 3(a), is shown in gures 6(a) and (b), where 6

(8)

0,0 1,0 0,1 1,1

A B A B

2,0 3,0 2,1 3,1 D

(a) D

A B A B A B A B A B A B

C D C D C D C D C D C D C D C D

Prologue

0,0 1,0 2,0 3,0 0,1 1,1 2,1 3,1 0,2 1,2 2,2 3,2 0,3 1,3 2,3 3,3

(b)

s=(1,3)

Figure 6: (a) DG based on the replication of an MDFG, after retiming. (b) same DG represented by computational cells.

B (-1,0)

A D

C (1,0) (0,1)

r(D) = (0,1) r(A) = (0,0) r(B) = (0,0) r(C) = (0,0)

(a) 0,0 1,0 2,0 3,0

0,1 1,1 2,1 3,1 0,2 1,2 2,2 3,2

Figure 7: (a) Example of illegal retiming. (b) DG showing cycles in the x-direction.

the nodes originally belonging to iteration (0^;0) are marked. A possible schedule vector for the retimed graph is^s = (1^;3). Figure 7(a) shows an illegal retiming function applied to the same example. By simple inspection of the cell dependence graph in gure 7(b) we notice the existence of a cycle created by the dependencies (1^;0) and (^;1^;0).

A prologue is the set of instructions that are moved on directions x and y, in a two- dimensional retiming, and that must be executed to provide the necessary data for the iterative process. In our example shown in gure 6(a), the instruction D becomes the prologue for that problem. Theepilogue is the other extreme of the DG, were a complementary set of instructions will be executed to complete the process. In the example, the reduction of the critical path, as seen in gure 5, results in a saving of approximately one third of the total execution time.

7

(9)

3 Incremental Multi-Dimensional Retiming

3.1 Basic Concepts

The technique of multi-dimensional retiming, allows changes in more than one direction [4, 19]. Our method uses this property to achieve the full parallelism by incrementally applying the retiming function to multi-dimensional problems. Properties, algorithm and supporting theorems for the incremental multi-dimensional retiming are presented below. We begin by dening a legal retiming function.

Denition 3.1

Given a realizable MDFG^G= (^V^;^E;^d;^t) alegal retimingfor^Gis the multi- dimensional retiming^r that transforms ^G in^G^r such that ^G^r is still realizable.

To nd a legal retiming of an MDFG, such that all its nodes can be executed in parallel is equivalent to obtaining a cycle period that equals one when we assume the execution time of each node to be one time unit. This problem is more complex than the use of traditional retiming on one-dimensional cases. In those cases, the positive delays after the transformation guarantee the legal retiming. However, for the multi-dimensional problem, positive delay vectors are too restrictive since a dependence graph is still realizable even if it has negative delays. Therefore, the following theorem gives a more general set of constraints for a legal retiming that produces full parallelism among the nodes of an MDFG.

Theorem 3.1

Let ^G = (^V^;^E^;^d;^t) be an MDFG with ^t(^v) = 1 for any ^v ² ^V. ^r is a legal multi-dimensional retiming for ^G such that ^C(^G^r) = 1 if and only if:

1. the cell dependence graph of the retimed MDFG^G^r = (^V;^E;^d^r^;^t)does not contain any cycle.

2. ^d^r(ê)⁶= (0^;0^;^:^:^:^;0) for any ê ²Ê

3. if the execution time of a path ^p in ^G^r is greater than one, then ^d^r(^p)⁶= (0^;0^;^:^:^:^;0).

Proof: The rst condition results from the realizability of the nal cell dependence graph. If a cycle was allowed, we would nd cells depending on their own results, waiting to be executed, which is a contradiction. The second and third conditions result from the properties associated with the cycle period. ²

Our technique enforces these constraints through the use of an incremental retiming that guarantees the realization of each new retimed graph until all edges become non-zero delay edges.

8

(10)

d d’

x y

(0,0)

d’’

scheduling subspace S

Figure 8: An example of scheduling subspace

3.2 The Scheduling Subspace

We know that the existence of a linear schedule vector for the execution of the MDFG is the necessary and sucient condition for its realizability [13]. We dene the space domain for schedule vectors of an MDFG as follows:

Denition 3.2

A scheduling subspace^S for a realizable MDFG ^G= (^V;Ê;^d;^t) is the space region where there exists schedule vectors that realize G, i.e., if schedule^s²^Sthen^sd(ê)0 for any ê²Ê.

An example of scheduling subspace is shown in gure 8 for a two-dimensional case. Using such concepts, we dene the basic conditions for a legal multi-dimensional retiming through the following lemma.

Lemma 3.2

Let ^G = (^V^;^E;^d;^t) be an MDFG, ^r a multi-dimensional retiming, and ^s a schedule vector for the retimed graph ^G^r = (^V;^E;^d^r^;^t), then

(a) for any path ^u^p^;^v, we have ^d^r(^p) =^d(^p) +^r(^u)^;^r(^v)

(b) for any cycle ^l²^G, we have ^d^r(^l) =^d(^l)

(c) for any edge û^;^!ê ^v, if ^d^r(ê)⁶= (0^;0^;^:^:^:^;0) then ^d^r(ê)^s0

3.3 The Selection of the Retiming Function

In this segment, we show how to select a legal retiming function for a realizable MDFG. Such selection is based on properties of the scheduling subspace associated with the MDFG. We begin by dening a strictly positive scheduling subspace.

9

(11)

Denition 3.3

Given a realizable MDFG ^G = (^V;Ê;^d;^t) with a scheduling subspace ^S, a strictly positive scheduling subspace ^S⁺ is the set of all vectors ^s ² ^S such that ^d(ê)^s ^> 0 for every^d(ê)⁶= (0^;0^;^:^:^:^;0).

It is obvious that if a scheduling subspace ^S is not empty, then the strictly positive scheduling subspace ^S⁺ associated with ^S is also not empty. Using the strictly positive scheduling subspace concept, we introduce the method of predicting a legal multi-dimensional retiming function in the next theorem.

Theorem 3.3

Let ^G = (^V^;Ê;^d;^t) be a realizable MDFG, ^S⁺ a strictly positive scheduling subspace for ^G, ^s a schedule vector in ^S⁺, û²^V a node with all the incoming edges having non-zero delays. A legal retiming ^r(û) is any vector orthogonal to ^s.

Proof: From the denition of retiming,we have^d^r(ê¹) =^d(ê¹)^{; r}(û) and^d^r(ê²) =^d(ê²)+^r(û), for some ^;ê^!¹ û ^;ê^!² and ^d(ê¹) ⁶= (0^;0^;^:^:^:0). To verify if the resulting MDFG is realizable, we compute the inner product of ^s and each of the retimed dependence vectors. Then,

d

r(ê¹)^s=^d(ê¹)^s;r(û)^s =^d(ê¹)^s^>0, sinceê¹is an incoming edge and^d(ê¹)⁶= (0^;0^;^:^:^:0).

Forê²,^d^r(ê²)^s =^d(ê²)^s+^r(û)^s=^d(ê²)^s0. We have now two cases:

Case 1: if^d(^e²)⁶= (0^;0^;^:^:^:0) then^d(^e²)^s^>0. By lemma 3.2 the resulting graph is realizable.

Case 2: if ^d(ê²) = (0^;0^;^:^:^:0) then ^d^r(ê²) = ^r(û). Since ^d^r(ê¹)^s ^> 0 and for each edge ê,

d(^e)^s^>0, it is impossible to have a linear combination of these delay vectors orthogonal to

s; i.e., parallel to ^d^r(^e²) =^r(^u). Therefore, no cycle can be found and the resulting graph is realizable. ²

Therefore, given a set of dependence vectors, after dening its strictly positive scheduling subspace, we can predict a legal retiming function. These results are used in our incremental retiming algorithm. Here we introduce two important corollaries from theorem 3.3.

Corollary 3.4

Given a realizable MDFG ^G = (^V;^E;^d;^t), ^S⁺ a strictly positive scheduling subspace for^G, ^s a schedule vector in ^S⁺, and a vector ^r orthogonal to ^s, if a set ^X ^V has all incoming edges non-zero, then ^r(^X) is a legal retiming.

Proof: For some ^;ê^!¹ ^X ^;ê^!² , we know that ^d(ê¹) ⁶= (0^;0^;^:^:^:0). Considering theorem 3.3 we conclude that ^r is a legal retiming in^G. ²

Corollary 3.5

Given a realizable MDFG ^G = (^V;^E;^d;^t), ^S⁺ a strictly positive scheduling subspace for ^G, ^s a schedule vector in ^S⁺, a vector ^r orthogonal to ^s, a set ^X ^V with all incoming edges non-zero, and an integer value ^k ^>1, then (^k^r)(^X) is a legal retiming.

Proof: Proved immediately from theorem 3.3 and corollary 3.4. ²

10

(12)

3.4 The Incremental Multi-Dimensional Retiming Algorithm

The ability to predict a legal retiming for any realizable MDFG allow us to dene the incremental multi-dimensional retiming algorithm using the following steps:

1. Given a realizable MDFG^G= (^V^;Ê^;^d;^t), we use the idea of theorem 3.3 to nd a legal retiming by solving the inequalities^s^d(ê)^>0 for everyê²Ê, where^sis the unknown, we choose a retiming function from the hyperplane with ^s as the normal vector.

2. We apply the selected retiming function to any node that has all incoming edges with non-zero delays and at least one outgoing edge with zero delay.

3. Since the resulting MDFG is still realizable, if there are zero delay edges, go back to step 1.

As consequence of the concepts presented, we state the main theorem as follows:

Theorem 3.6

^Let ^G= (^V;^E;^d;^t)be a realizable MDFG, the incremental multi-dimensional retiming algorithm transforms^Gto^G^r, in at most^jV^jiterations, such that^G^r is fully parallel.

Proof: After an iteration of the incrementalmulti-dimensionalretiming algorithm, the resulting MDFG is still realizable. Successive iterations of nding^s and the respective incremental retiming function allow us to modify all zero delay edges to non-zero ones, obtaining a fully parallel MDFG. After each iteration, all outgoing edges of at least one new node will not have any zero delay. After at most^jV^j iterations full-parallelism is achieved. ²

Let's examine the example in gure 3. The rst incremental retiming is shown in gure 5, where ^s has value (0^;1). The next retiming must consider the new set of dependencies:

f(^;2^;1)^;(0^;1)^;(1^;0)^g. We choose ^s = (1^;3) then, the retiming function for node A can be either (^;3^;1) or (3^;^;1). Choosing ^r(^A) = (^;3^;1) we obtain the graph on gure 9(a). The equivalent Fortran code is not a simple representation of the graph because it requires a recursion schedule vector [13], i.e., a schedule vector parallel to one of the axes in the index space. We useloop skewing[24, 25] to adjust the schedule vector. The code equivalent to the nal retimed graph is shown in gure 9(b).

4 Chained Multi-Dimensional Retiming

The previous section described a method to nd a fully parallel solution in several iterations of a incremental retiming operation. To increase the eciency of our algorithm, we now introduce the concepts that allow us to obtain the full parallelism solution in a single pass.

Denition 4.1

A chain is a path ^p=^v⁰ ^;ê^!⁰ ^v¹ ^;ê^!¹ ^v²^:^:^:ê^;^{k ;1}^! ^v^k, where all incoming edges for node ^v⁰ are non-zero delay edges and there exists at least one zero delay edge preceding

11

(13)

(b) DO 10 jp = 3, 4*n+m-1

B (-2,1) A D

C (0,1) (4,-1) (a)

(-3,1) (-3,1)

--- prologue ---

--- epilogue ---

DO 11 kp = max(3,jp-n), min(3*n+m-1,jp-3)

d(4*kp-3*jp+1,jp-kp) = b(4*kp-3jp+2,jp-kp-1) * c(4*kp-3*jp,jp-kp-1) a(4*kp-3*jp-3,jp-kp+1)= d(4*kp-3*jp-3,jp-kp+1) * .5

b(4*kp-3*jp,jp-kp) = a(4*kp-3*jp,jp-kp) + 1.

c(4*kp-3*jp,jp-kp) = a(4*kp-3*jp,jp-kp) + 2.

11 CONTINUE

10 CONTINUE

Figure 9: (a) MDFG fully parallel (b) equivalent Fortran code after loop skewing the nodes ^vⁱ^;1ⁱ^k. The nodes are numbered in a monotonically increasing order, and ^k is said to be the length of the chain.

From the denition above we derive a method for identifying all chains belonging to an MDFG. We call this construction a multi-chain structure and it is built according to the following steps:

1. all non-zero delay edges are removed from the MDFG. Since there is no zero-delay cycle for a realizable MDFG, the result is a directed acyclic graph (DAG).

2. a modied topological sort algorithm [22] is used to order the nodes in levels from the end to the beginning ². Each node is labeled according to its level number, which produces the monotonically increasing characteristic of the node indices in a chain.

We call a multi-chain maximum length the highest level number obtained in the construction process described above. An example of chain can be obtained from gure 3(a). The existing chains are ^D^;Â^;^B and ^D^;Â^;^C. Node ^D is at level 0, nodeÂ at level 1, and nodes ^B and ^C at level 2. Those chains have a multi-chain maximum length of 2. To speed up the incremental retiming process, we introduce the idea of a chained retiming according to the new algorithm below:

1. Find a legal retimingfunction^ras in the rst step of the incrementalretimingalgorithm.

2. Construct the multi-chain structure, labeling the nodes accordingly and computing the multi-chain maximum length ^k.

3. Retime each node ^vⁱ by (^k^;ⁱ)^r. The result is the fully parallel MDFG.

2Every node is assigned to a unique level number, even though a node belongs to multiple chains

12

(14)

B (-3,1)

A D

C (-1,1) (1,0)

(a) (1,0)

(1,0)

DO 10 j = 0, n

d(0,j) = b(1,j-1) * c(-1,j-1) d(1,j) = b(2,j-1) * c(0,j-1) a(0,j) = d(0,j) * .5 DO 11 k = 0, m-2

d(k+2,j) = b(k+3,j-1) * c(k+1,j-1) a(k+1,j) = d(k+1,j) * .5 b(k,j) = a(k,j) + 1.

c(k,j) = a(k,j) + 2.

11 CONTINUE a(m,j) = d(m,j) * .5 b(m-1,j) = a(m-1,j) + 1.

c(m-1,j) = a(m-1,j) + 2.

b(m,j) = a(m,j) + 1.

c(m,j) = a(m,j) + 2.

10 CONTINUE (b)

prologue

epilogue

}

Figure 10: (a) MDFG fully parallel using chained retiming (b) equivalent Fortran code For this algorithm, only one step of incremental retiming is executed at the beginning. The rest of the algorithm requires only ^O(^jEj) times to be performed. Therefore, this algorithm is more ecient than the incremental retiming. The next theorem proves that the MDFG obtained from the chained retiming algorithm is realizable.

Theorem 4.1

Let ^G = (^V;^E;^d;^t) be a realizable MDFG, the chained multi-dimensional retiming algorithm transforms ^G to ^G^r, such that ^G^r is realizable and fully parallel.

Proof: By using the chained multi-dimensional algorithm, we nd ^r an incremental retiming for ^G, ^k the multi-chain maximum length of G, and ⁱ the level number assigned to each node ^v ² ^V. The retiming of each node ^vⁱ by (^k^;ⁱ)^r results in nal dependence vectors of the form ^d(^e)^;^j ^r, ^d(^e) +^j ^r, or ^j ^r where 1 ^j ^k^;1. By corollary 3.5, the retimed MDFG is realizable, and since there is no zero-delay edges, then ^G^r is fully parallel. ²

Revisiting the example in gure 3, we nd out that for ^k = 2 (length of existing chains) and ^r = (1^;0) (rst incremental retiming function) we could have applied only one retiming step in such way that node D would be retimed by 2(1^;0), i.e., (2^;0), and node A would be retimed by 1(1^;0). The nal graph and Fortran code are shown in gure 10.

5 Experiments

In this section we present the application of our method to two dierent examples. Our rst example is the IIR lter introduced in section 1. The second example is an MDFG representing a wave digital lter. This MDFG computes the solution for a partial dierential equations problem (a transmission line problem), based on the Fettweis method [9].

13

(15)

M8 A4 A6 A5

M7

M6 A3

M5

M4 A2

M3

M2 A1

M1 A7

(1,0) (1,1) (2,-1) (2,0) (2,1) (3,-1) (3,0)

(2,-1)

(a) (3,1)

(2,-1)

(2,-1) (2,-1)

(2,-1)

(2,-1) (4,-2)

(4,-2)

(-3,2)

(-3,2) (-7,4)

M8 A4 A6 A5

M7

M6 A3

M5

M4 A2

M3

M2 A1

M1 A7

(1,0) (1,1) (2,-1) (2,0) (2,1) (3,-1) (3,0)

(2,-1)

(b) (3,1)

(2,-1)

(2,-1) (2,-1)

(2,-1)

(2,-1) (4,-2)

(4,-2)

(12,-6)

(-3,2) (8,-4) (-15,8)

Figure 11: (a) MDFG after retiming A6 (b) fully parallel MDFG after retiming A5

IIR Filter

We now revisit the example shown in gure 1. Assume both adders and multipliers take one time unit to compute. Using the incremental retiming concept, we begin by retiming all nodes labeled ^M by ^r = (^;1^;1) for a schedule vector ^s = (1^;1). Figure 2(a) shows the resulting MDFG. The new set of dependencies ^D is now given by:

D =

"

3 3 3 2 2 2 1 1 ^;1

1 0 ^;1 1 0 ^;1 1 0 1

#

After retiming nodes Â1 to Â4 by ^r = (^;3^;2), the graph presented in gure 2(b) is produced. In the next step, we retime node Â6 by ^r = (^;7^;4) and we obtain the graph shown in gure 11(a). The nal graph is presented in gure 11(b) after retiming the node

A5 by ^r = (^;15^;8).

Our second approach is to use the chained retiming algorithm. We nd the topological or- dering of the DAG associated with the graph in gure 1. Nodes labeled^M5 to^M8 are assigned to level 0, nodes ^M3, ^M4, Â3 and Â4 to level 1, ^M1,^M2, Â2 and Â6 to level 2, Â1 and Â5 to level 3, and nally,Â7 to level 4. Therefore the multi-chain maximumlength of G is 4. We know that the rst incremental retiming function for this set of dependencies is (^;1^;1). Then we may apply the following retiming functions to the graph: ^r(^Mⁱ) = (^;4^;4) ^for ⁱ= 5 ^to8,

r(^M3) = ^r(^M4) = ^r(Â3) = ^r(Â4) = (^;3^;3), ^r(^M1) = ^r(^M2) = ^r(Â2) =^r(Â6) = (^;2^;2), and ^r(Â1) =^r(Â5) = (^;1^;1). The nal fully parallel graph is then shown in gure 12.

Transmission Line Simulation

14

(16)

M8 A4 A6 A5

M7

M6 A3

M5

M4 A2

M3

M2 A1

M1 A7

(2,-1) (2,0) (4,-3) (4,-2) (5,-2) (6,-4) (6,-3)

(-1,1) (6,-2)

(-1,1)

(-1,1) (-1,1)

(-1,1)

(-1,1) (-1,1)

(-1,1)

(-1,1) (-1,1) (-1,1)

Figure 12: Fully parallel MDFG obtained through chained retiming

In this example, we introduce a two-dimensional transmission line problem initially trans- formed by using the Fettweis method (see [9] as a tutorial). The transmission line is charac- terized by the equations below:

l @i

1

@t

2

+^{r i}¹+ ^@u

@t

1

=^f¹

@i

1

@t

1

+^c@u

@t

2

+^gu=^f²

After applying the Fettweis transformations we obtain the Wave Digital Filter shown in gure 13(a), which is equivalent to the MDFG in gure 13(b), where the inputs ^e¹ and^e² are zero.

The two-port adaptors, A and B, are expanded to their internal conguration. For simplicity, one output port and one adder in each adaptor were deleted because the particular boundary conditions applied. Using the incremental algorithm we would obtain the graph shown in gure 14(a), after incrementally applying the following retiming functions: ^r(^F) = ^r(Ê) = (1^;0), ^r(^H) = ^r(^G) = (^;3^;1), ^r(^B1) = ^r(Â1) = (^;7^;2), ^r(^B2) = ^r(Â2) = (^;15^;4), and

r(^B3) =^r(^A3) = (^;31^;8). Figure 14(b) shows the result of applying the chained retiming method to this case. The retimingfunctions were^r(^F) =^r(^E) = (5^;0),^r(^H) =^r(^G) = (4^;0),

r(^B1) =^r(Â1) = (3^;0),^r(^B2) =^r(Â2) = (2^;0), and ^r(^B3) =^r(Â3) = (1^;0).

6 Conclusion

We have presented two novel techniques for optimizing a multi-dimensional data ow graph through the use of a multi-dimensional retiming method that we developed. We call these

15

(17)

e1

A B

e2

T1,T2

+ +

-

C D

G E F H -.5 -.5

V1-T1,T2 V2

(a)

D F H B1

C E G A1

A3 A2 B3 B2

Multipliers Adders (1,1)

(1,1) (-1,1)

(-1,1)

(b)

Figure 13: (a) Wave Digital Filter graph (b) equivalent MDFG

A3 A2 D F H B1

C E G A1 B3 B2

Multipliers Adders (-2,1)

(-2,1) (0,1)

(0,1)

(a)

(4,-1)

(4,-1) (4,-1)

(8,-2) (-16,4)

(4,-1) (8,-2) (-16,4) (-31,8)

(-31,8) (-31,8) (-31,8)

(28,-7) (28,-7)

D F H B1

C E G A1

A3 A2 B3 B2

Multipliers Adders (-6,1)

(-6,1) (-4,1)

(-4,1)

(b)

(1,0)

(1,0) (1,0) (1,0)

(1,0)

(1,0) (1,0)

(1,0)

(3,0) (3,0)

Figure 14: Final MDFG for the transmission line problem, (a) using incremental retiming (b) using chained retiming

16

CiteSeerX — Nested Loop Transformation for Full Parallelism

Nested Loop Transformation for Full Parallelism

Abstract

1 Introduction

2 Basic Principles

2.1 Background

}

2.2 Retiming an Multi-Dimensional Data Flow Graph

3 Incremental Multi-Dimensional Retiming

3.1 Basic Concepts

De nition 3.1

Theorem 3.1

3.2 The Scheduling Subspace

De nition 3.2

Lemma 3.2

3.3 The Selection of the Retiming Function

De nition 3.3

Theorem 3.3

Corollary 3.4

Corollary 3.5

3.4 The Incremental Multi-Dimensional Retiming Algorithm

Theorem 3.6

4 Chained Multi-Dimensional Retiming

De nition 4.1

}

}

Theorem 4.1

5 Experiments

IIR Filter

Transmission Line Simulation

6 Conclusion

Denition 3.1

Denition 3.2

Denition 3.3

Denition 4.1