• No results found

CiteSeerX — Nested Loop Transformation for Full Parallelism

N/A
N/A
Protected

Academic year: 2022

Share "CiteSeerX — Nested Loop Transformation for Full Parallelism"

Copied!
19
0
0

Loading.... (view fulltext now)

Full text

(1)

Nested Loop Transformation for Full Parallelism

Nelson Luiz Passos Edwin Hsing-Mean Sha Research Report CSE-TR-94-011

April 1994

Abstract

Transformation techniques are usually applied to get optimal execution rates in parallel and/or pipeline systems. The retiming technique is a common and valuable tool in one- dimensional problems. Most scienti c or DSP applications are recursive or iterative. Uni- form nested loops can be modeled as multi-dimensional data ow graphs (DFGs). To achieve full parallelism of the loop body, i.e., all the computational nodes executed in parallel, sub- stantially decreases the overall computation time. It is well known that for one-dimensional DFGs retiming can not always achieve full parallelism. This paper shows an important and counter-intuitive result, which proves that we can always obtain full-parallelism for DFGs with more than one dimension. It also presents two novel multi-dimensional retiming tech- niques to obtain full parallelism. Examples, description and the correctness of our algorithms are presented in the paper.

(2)

1 Introduction

Applications such as image processing, uid mechanics, and weather forecasting, require high computer performance. Researchers and designers in those areas are looking for solutions to multi-dimensional problems through the use of parallel computers and/or specialized hard- ware, which justi es the study of multi-dimensional optimization algorithms. Ecient trans- formation mechanisms, for parallel and/or pipelined processing of one-dimensional problems represented by Data Flow Graphs, have been proposed in previous studies using retiming techniques [16]. The retiming technique regroups operations in iterations in order to produce a new iteration structure with higher embedded parallelism [4, 21]. An equivalent approach to the multi-dimensional case will bene t parallel compiler design for VLIW, data- ow, and superscalar architectures [1, 10, 14], as well as high-level synthesis for VLSI design [3, 18].

Computation-intensive applications usually depend on time-critical sections consisting of a loop of instructions. To optimize the execution rate of such applications, the designer needs to explore the parallelism embedded in repetitive patterns of a loop. Some research has been done on uniform nested loop scheduling, a similar view to a multi-dimensional problem. For example, unimodular transformations [25], loop skewing [24] and loop quantization [2]. These techniques di er from our method since they don't change the structure of iterations, but the sequence in which the instructions are executed.

Other techniques that search beyond loop boundaries include perfect pipelining [1] and Doacross loops [7]. Perfect pipelining looks for a repeating pattern for a single loop with the disadvantage of unpredictability of the size of such pattern and the gain of performance.

Doacross loop presents an overlap of consecutive iterations that contributes to the improve- ment of the execution rate. These two methods can be regarded as special cases of our technique.

In our study, loop bodies are represented by multi-dimensional data ow graphs (MDFG).

Instead of simply overlapping iterations, we restructure the loop body, i.e., the existing depen- dence delays, preserving the original data dependence. We model the process of restructuring as a multi-dimensional retiming. The new MDFG is constructed from the original graph through the retiming function. The retiming process and loop pipelining are, in essence, the same concept [5]. Previous research in loop pipelining for loops with cyclic dependencies pro- duced methods which focus on one-dimensional problems. Such methods appear in several systems [11, 15, 20, 23].

In the one-dimensional retiming method, negative delays represent the existence of a cycle in the execution sequence and can not be allowed. However, on the multi-dimensional case, the existence of negative delays is to be considered natural if there exists a schedule

s that turns the graph realizable. Chao and Sha introduced the concept of a restricted multi-dimensional retiming [4]. Their algorithm can not guarantee to achieve full parallelism and is only applicable to a speci c class of MDFGs, where any cycle would have a strictly non-negative total multi-dimensional delay. In this paper, two new algorithms are presented:

1

(3)

M8 A4 A6 A5

M7

M6 A3

M5

M4 A2

M3

M2 A1

M1 A7

(2,2) (2,1)

(2,0)

(1,2)

(1,1) (1,0) (0,2) (0,1)

(a)

DO 10 n1=1,N1 DO 20 n2=1,N2

y(n1,n2) = x(n1,n2) + c(0,1)*y(n1,n2-1) + - c(0,2)*y(n1,n2-2) + c(1,0)*y(n1-1,n2) + - c(1,1)*y(n1-1,n2-1) + c(1,2)*y(n1-1,n2-2) + - c(2,0)*y(n1-2,n2) + c(2,1)*y(n1-2,n2-1) + - c(2,2)*y(n1-2,n2-2)

20 CONTINUE 10 CONTINUE

(b)

Figure 1: (a) MDFG representing an IIR lter (b) equivalent Fortran code

incremental multi-dimensional retiming and chained multi-dimensional retiming. The rst uses a legal retiming to successively restructure the loop body represented by a general form of MDFG. The second improves the eciency of the rst, by obtaining the nal solution in only one pass. Both techniques achieve full parallelism.

Full parallelism is the simultaneous execution of all operations (nodes) in a MDFG. There- fore, to obtain full parallelism is equivalent to obtain non-zero delay in all edges of the MDFG.

In the one-dimensional case, such a condition can not always be achieved through retiming due to the constancy of the sum of delays in a cycle.

For simplicity, we use two-dimensional problems without instructions between loops as examples. The two dimensions are generically referred as x and y. The multi-dimensional case is a straight forward extension of the concepts presented. The main example consists of an IIR lter (In nite Extent Impulse Response) [8], represented by the transfer function:

H(z1;z2) =  1

1;

P

2

n

1

=0 P

2

n

2

=0 c(n

1

;n

2 )z

;n

1

1

z

;n

2

2



which can be translated in

y(n1;n2) =x(n1;n2) +P2k1=0P2k2=0c(k1;k2)y(n1;k1;n2;k2) for k1;k2 6= 0.

The MDFG derived from the equation above is shown in gure 1(a). An equivalent Fortran code is presented in gure 1(b). The retimed graph is presented in gure 2(a), where the two-dimensional delay (;1;1) is pushed through all nodes labeled M on the original MDFG. Intuitively, the retimed nodes no longer precede the additions A1 to A4 within the same iteration, which allows those operations to be executed simultaneously. This new characteristic reduces the critical path of the graph, and consequently the overall execution time. Figure 2(b) shows the result from a new retiming operation, where the two-dimensional

2

(4)

M8 A4 A6 A5

M7

M6 A3

M5

M4 A2

M3

M2 A1

M1 A7

(1,0) (1,1) (2,-1) (2,0) (2,1) (3,-1) (3,0)

(-1,1)

(a) (3,1)

(-1,1)

(-1,1)

(-1,1)

(-1,1) (-1,1)

(-1,1)

(-1,1)

M8 A4 A6 A5

M7

M6 A3

M5

M4 A2

M3

M2 A1

M1 A7

(1,0) (1,1) (2,-1) (2,0) (2,1) (3,-1) (3,0)

(2,-1)

(b) (3,1)

(2,-1)

(2,-1)

(2,-1)

(2,-1) (2,-1)

(2,-1)

(2,-1) (-3,2)

(-3,2)

(-3,2)

(-3,2)

Figure 2: (a) MDFG after retiming all nodes M (b) MDFG after retiming A1 to A4 delay (;3;2) was pushed through nodesA1 toA4, improving the parallelism and reducing the critical path furthermore. Full parallelism is achieved when all edges have non-zero delays.

In section 2, we show some basic concepts, such as mathematical models for Dependence Graphs and Data Flow Graphs. Also, we mention the basic ideas of multi-dimensional re- timing and cycle period. Such concepts are useful on the following section when developing the algorithm for incremental retiming. Section 3 shows the properties necessary to have a legal retiming function. The incremental multi-dimensional retiming algorithm is described, giving us the foundations required for the main theorem that proves that full parallelism can always be achieved. Section 4 presents a more ecient algorithm which uses the concept of a chain reaction. Section 5 shows some examples of application of our techniques, followed by a nal conclusion that summarizes the concepts introduced in this paper.

2 Basic Principles

2.1 Background

In this section we present some concepts related to the interpretation of multi-dimensional data ow graphs, such as mathematical models, dependence vectors, iterations, and vector operations.

A multi-dimensional data ow graph (MDFG) G = (V;E;d;t) is a node-weighted and edge-weighted directed graph, whereV is the set of computation nodes,E  V V is the set of dependence edges, dis a function from E to Zn, representing the multi-dimensional delay between two nodes, where n is the number of dimensions, and t is a function fromV to the

3

(5)

B (-1,1)

A D

C (1,1)

(a)

DO 10 j = 0,n DO 11 k = 0, m

D: d(k,j) = b(k+1,j-1) * c(k-1,j-1) A: a(k,j) = d(k,j) * .5

B: b(k,j) = a(k,j) + 1.

C: c(k,j) = a(k,j) + 2.

11 CONTINUE 10 CONTINUE

(b)

Figure 3: (a) MDFG extracted from a Wave Digital Filter (b) equivalent Fortran code positive integers, representing the computation time of each node. A two-dimensional data ow graph (2DFG) G = (V;E;d;t) is an MDFG, where d is a function from E to Z2. We use d(e) = (d:x;d:y) as a general formulation of any delay shown in a two-dimensional DFG.

An example of a two-dimensional DFG and its equivalent Fortran code is shown in gure 3. For this example, V = fA;B;C ;D g and E = fe1 : (A;B);e2 : (A;C);e3 : (D ;A);e4 : (B;D);e5 : (C ;D)g where, d(e1) = d(e2) = d(e3) = (0;0), d(e4) = (;1;1), d(e5) = (1;1).

For simplicity, each operation is assumed to be executed in one time unit, therefore, t(A) =

t(B) =t(C) =t(D) = 1.

Aniterationis the execution of the loop body exactly once, i.e., the execution of each node in V exactly once. Iterations are identi ed by a vector i, equivalent to a multi-dimensional index, starting from (0;0;:::;0). Inter-iteration dependencies are represented by vector- weighted edges. For any iteration ^j, an edgeefromu to v with delay vectord(e) means that the computation of node v at iteration ^j depends on the execution of node u at iteration

^

j ;d(e). An edge with delay (0;0;:::;0) represents a data dependence within the same iteration. A legal MDFG must have no zero-delay cycle, i.e., the summation of the delay vectors along any cycle can not be (0;0;:::;0). Several techniques are available to verify that an MDFG does not have a cycle [6, 12].

An equivalentcell dependence graph of an MDFG G is the directed acyclic graph showing the dependencies between copies of nodes representing the MDFG. Figure 4(a) shows the replication of the MDFG in gure 3(a), and gure 4(b) shows the cell dependence graph with each node representing a copy of the MDFG. A computational cell is the cell dependence graph (DG) node that represents a copy of an MDFG, excluding the edges with delay vectors di erent from (0;0;:::;0), i.e., a complete iteration. A cell is considered an atomic execution unit.

A mathematical model for a cell dependence graph expanded from an MDFG is presented in [17]: a tuple (In;D) whereInis the index set of the cell structure, equivalent to the vectors

4

(6)

0,0 1,0 0,1 1,1

(a) A B A B A B A B

C D C D C D C D

2,0 3,0 2,1 3,1

A B A B A B A B

C D C D C D C D

0,0 1,0 2,0 3,0 0,1 1,1 2,1 3,1 0,2 1,2 2,2 3,2 0,3 1,3 2,3 3,3

(b)

Figure 4: (a) DG based on the replication of an MDFG, showing iterations starting at (0,0).

(b) DG represented by computational cells

used to identify each iteration, and D is a matrix containing all dependence vectors (delay vectors). Consider, for example, the DG shown in Figure 4; its model will be described as:

I

2 =f(i;j) : 0i3;0j 3g, and D =

"

;1 1 1 1

#

Uniform loops are those that present the characteristic of constant dependence vectors, i.e., data dependencies are at a constant distance in the iteration space.

The notationu;!e vmeans thateis an edge from nodeuto nodev. The notationu p;v means thatpis a path fromutov. The delay vector of a pathp=v0 ;e!0 v1 ;e!1 v2:::e;k ;1! vk is d(p) =Pk ;1i=0 d(ei) and the total computation time of a path pis Pki=0t(vi).

To manipulate MDFG characteristics represented on vector notation, such as the delay vectors, we make use of component-wise vector operations. Considering two two-dimensional vectors P and Q, represented by their coordinates (P :x;P :y) and (Q:x;Q:y), an example of arithmetic operation is P +Q = (P :x+Q:x;P :y+Q:y). The notation P Q indicates the inner product between P and Q, i.e.,P Q=P :xQ:x+P :yQ:y.

A schedule vector s is the normal vector for a set of parallel equitemporal hyperplanes that de ne a sequence of execution of a cell dependence graph. The existence of a schedule vector prevents the existence of a cycle. We say that an MDFGG= (V;E;d;t) isrealizableif there exists a schedule vector sfor the cell dependence graph with respect to G, i.e.,sd0 for any d2G [13].

5

(7)

B (-2,1) A D

C (0,1) (1,0)

(a)

DO 10 j = 0, n

d(0,j) = b(1,j-1) * c(-1,j-1)

(b)

}prologue

epilogue DO 11 k = 0, m-1

d(k+1,j) = b(k+2,j-1) * c(k,j-1) a(k,j) = d(k,j) * .5

b(k,j) = a(k,j) + 1.

c(k,j) = a(k,j) + 2.

11 CONTINUE a(m,j) = d(m,j) * .5 b(m,j) = a(m,j) + 1.

c(m,j) = a(m,j) + 2.

10 CONTINUE

}

Figure 5: (a) MDFG after retimed by r(D)=(1,0) (b) equivalent Fortran code

2.2 Retiming an Multi-Dimensional Data Flow Graph

In this subsection, we show the basic ideas on multi-dimensional retiming, including cycle period, critical path, and characteristics of legal retiming.

The period during which all computation nodes in an iteration are executed, according to existing data dependencies and without resource constraints, is called a cycle period. The cycle period C(G) of an MDFG G= (V;E;d;t) is the maximum computational time among paths that have no delay. For example, the MDFG in gure 3(a) has C(G) = 3, which can be measured through the paths p=D !A!B or p=D !A!C.

A multi-dimensional retiming ris a function fromV to Znthat redistributes the nodes in the original dependence graph created by the replication of an MDFG G. A new MDFG Gr is created, such that each iteration still has one execution of each node in G. The retiming vectorr(u) of a node u2Grepresents the o set between the original iteration containing u, and the one after retiming. The delay vectors change accordingly to preserve dependencies, i.e.,r(u) represents delay components pushed into the edgesu !v, and subtracted from the edges w ! u, where u;v;w 2 G. Therefore, we have dr(e) = d(e) +r(u);r(v) for every edge u ;!e v and dr(l) =d(l) for every cyclel 2G. After retiming, the execution of node u in iteration i is moved to the iteration i;r(u). A two-dimensional retiming r is a function from V to Z2 that redistributes the nodes in the original dependence graph created by the replication of a 2DFG G, resulting in a new 2DFGGr, such that each iteration still has one execution of each node in G, reducing the cycle period as in the one-dimensional case. For example, gure 5(a) shows the MDFG from gure 3(a) retimed by the function r. Figure 5(b) shows the modi ed Fortran code. The critical paths of this graph are the edges A!B and A!C with an execution time of 2 time units.

The retimed cell DG, for the example in gure 3(a), is shown in gures 6(a) and (b), where 6

(8)

0,0 1,0 0,1 1,1

A B A B

2,0 3,0 2,1 3,1 D

(a) D

A B A B A B A B A B A B

C D C D C D C D C D C D C D C D

Prologue

0,0 1,0 2,0 3,0 0,1 1,1 2,1 3,1 0,2 1,2 2,2 3,2 0,3 1,3 2,3 3,3

(b)

s=(1,3)

Figure 6: (a) DG based on the replication of an MDFG, after retiming. (b) same DG represented by computational cells.

B (-1,0)

A D

C (1,0) (0,1)

r(D) = (0,1) r(A) = (0,0) r(B) = (0,0) r(C) = (0,0)

(a) 0,0 1,0 2,0 3,0

0,1 1,1 2,1 3,1 0,2 1,2 2,2 3,2

Figure 7: (a) Example of illegal retiming. (b) DG showing cycles in the x-direction.

the nodes originally belonging to iteration (0;0) are marked. A possible schedule vector for the retimed graph iss = (1;3). Figure 7(a) shows an illegal retiming function applied to the same example. By simple inspection of the cell dependence graph in gure 7(b) we notice the existence of a cycle created by the dependencies (1;0) and (;1;0).

A prologue is the set of instructions that are moved on directions x and y, in a two- dimensional retiming, and that must be executed to provide the necessary data for the iter- ative process. In our example shown in gure 6(a), the instruction D becomes the prologue for that problem. Theepilogue is the other extreme of the DG, were a complementary set of instructions will be executed to complete the process. In the example, the reduction of the critical path, as seen in gure 5, results in a saving of approximately one third of the total execution time.

7

(9)

3 Incremental Multi-Dimensional Retiming

3.1 Basic Concepts

The technique of multi-dimensional retiming, allows changes in more than one direction [4, 19]. Our method uses this property to achieve the full parallelism by incrementally applying the retiming function to multi-dimensional problems. Properties, algorithm and supporting theorems for the incremental multi-dimensional retiming are presented below. We begin by de ning a legal retiming function.

De nition 3.1

Given a realizable MDFGG= (V;E;d;t) alegal retimingforGis the multi- dimensional retimingr that transforms G inGr such that Gr is still realizable.

To nd a legal retiming of an MDFG, such that all its nodes can be executed in parallel is equivalent to obtaining a cycle period that equals one when we assume the execution time of each node to be one time unit. This problem is more complex than the use of traditional retiming on one-dimensional cases. In those cases, the positive delays after the transformation guarantee the legal retiming. However, for the multi-dimensional problem, positive delay vectors are too restrictive since a dependence graph is still realizable even if it has negative delays. Therefore, the following theorem gives a more general set of constraints for a legal retiming that produces full parallelism among the nodes of an MDFG.

Theorem 3.1

Let G = (V;E;d;t) be an MDFG with t(v) = 1 for any v 2 V. r is a legal multi-dimensional retiming for G such that C(Gr) = 1 if and only if:

1. the cell dependence graph of the retimed MDFGGr = (V;E;dr;t)does not contain any cycle.

2. dr(e)6= (0;0;:::;0) for any e 2E

3. if the execution time of a path p in Gr is greater than one, then dr(p)6= (0;0;:::;0).

Proof: The rst condition results from the realizability of the nal cell dependence graph. If a cycle was allowed, we would nd cells depending on their own results, waiting to be executed, which is a contradiction. The second and third conditions result from the properties associated with the cycle period. 2

Our technique enforces these constraints through the use of an incremental retiming that guarantees the realization of each new retimed graph until all edges become non-zero delay edges.

8

(10)

d d’

x y

(0,0)

d’’

scheduling subspace S

Figure 8: An example of scheduling subspace

3.2 The Scheduling Subspace

We know that the existence of a linear schedule vector for the execution of the MDFG is the necessary and sucient condition for its realizability [13]. We de ne the space domain for schedule vectors of an MDFG as follows:

De nition 3.2

A scheduling subspaceS for a realizable MDFG G= (V;E;d;t) is the space region where there exists schedule vectors that realize G, i.e., if schedules2Sthensd(e)0 for any e2E.

An example of scheduling subspace is shown in gure 8 for a two-dimensional case. Using such concepts, we de ne the basic conditions for a legal multi-dimensional retiming through the following lemma.

Lemma 3.2

Let G = (V;E;d;t) be an MDFG, r a multi-dimensional retiming, and s a schedule vector for the retimed graph Gr = (V;E;dr;t), then

 (a) for any path up;v, we have dr(p) =d(p) +r(u);r(v)

 (b) for any cycle l2G, we have dr(l) =d(l)

 (c) for any edge u;!e v, if dr(e)6= (0;0;:::;0) then dr(e)s0

3.3 The Selection of the Retiming Function

In this segment, we show how to select a legal retiming function for a realizable MDFG. Such selection is based on properties of the scheduling subspace associated with the MDFG. We begin by de ning a strictly positive scheduling subspace.

9

(11)

De nition 3.3

Given a realizable MDFG G = (V;E;d;t) with a scheduling subspace S, a strictly positive scheduling subspace S+ is the set of all vectors s 2 S such that d(e)s > 0 for everyd(e)6= (0;0;:::;0).

It is obvious that if a scheduling subspace S is not empty, then the strictly positive scheduling subspace S+ associated with S is also not empty. Using the strictly positive scheduling subspace concept, we introduce the method of predicting a legal multi-dimensional retiming function in the next theorem.

Theorem 3.3

Let G = (V;E;d;t) be a realizable MDFG, S+ a strictly positive scheduling subspace for G, s a schedule vector in S+, u2V a node with all the incoming edges having non-zero delays. A legal retiming r(u) is any vector orthogonal to s.

Proof: From the de nition of retiming,we havedr(e1) =d(e1); r(u) anddr(e2) =d(e2)+r(u), for some ;e!1 u ;e!2 and d(e1) 6= (0;0;:::0). To verify if the resulting MDFG is realizable, we compute the inner product of s and each of the retimed dependence vectors. Then,

d

r(e1)s=d(e1)s;r(u)s =d(e1)s>0, sincee1is an incoming edge andd(e1)6= (0;0;:::0).

Fore2,dr(e2)s =d(e2)s+r(u)s=d(e2)s0. We have now two cases:

Case 1: ifd(e2)6= (0;0;:::0) thend(e2)s>0. By lemma 3.2 the resulting graph is realizable.

Case 2: if d(e2) = (0;0;:::0) then dr(e2) = r(u). Since dr(e1)s > 0 and for each edge e,

d(e)s>0, it is impossible to have a linear combination of these delay vectors orthogonal to

s; i.e., parallel to dr(e2) =r(u). Therefore, no cycle can be found and the resulting graph is realizable. 2

Therefore, given a set of dependence vectors, after de ning its strictly positive scheduling subspace, we can predict a legal retiming function. These results are used in our incremental retiming algorithm. Here we introduce two important corollaries from theorem 3.3.

Corollary 3.4

Given a realizable MDFG G = (V;E;d;t), S+ a strictly positive scheduling subspace forG, s a schedule vector in S+, and a vector r orthogonal to s, if a set X V has all incoming edges non-zero, then r(X) is a legal retiming.

Proof: For some ;e!1 X ;e!2 , we know that d(e1) 6= (0;0;:::0). Considering theorem 3.3 we conclude that r is a legal retiming inG. 2

Corollary 3.5

Given a realizable MDFG G = (V;E;d;t), S+ a strictly positive scheduling subspace for G, s a schedule vector in S+, a vector r orthogonal to s, a set X V with all incoming edges non-zero, and an integer value k >1, then (kr)(X) is a legal retiming.

Proof: Proved immediately from theorem 3.3 and corollary 3.4. 2

10

(12)

3.4 The Incremental Multi-Dimensional Retiming Algorithm

The ability to predict a legal retiming for any realizable MDFG allow us to de ne the incre- mental multi-dimensional retiming algorithm using the following steps:

1. Given a realizable MDFGG= (V;E;d;t), we use the idea of theorem 3.3 to nd a legal retiming by solving the inequalitiessd(e)>0 for everye2E, wheresis the unknown, we choose a retiming function from the hyperplane with s as the normal vector.

2. We apply the selected retiming function to any node that has all incoming edges with non-zero delays and at least one outgoing edge with zero delay.

3. Since the resulting MDFG is still realizable, if there are zero delay edges, go back to step 1.

As consequence of the concepts presented, we state the main theorem as follows:

Theorem 3.6

Let G= (V;E;d;t)be a realizable MDFG, the incremental multi-dimensional retiming algorithm transformsGtoGr, in at mostjVjiterations, such thatGr is fully parallel.

Proof: After an iteration of the incrementalmulti-dimensionalretiming algorithm, the result- ing MDFG is still realizable. Successive iterations of ndings and the respective incremental retiming function allow us to modify all zero delay edges to non-zero ones, obtaining a fully parallel MDFG. After each iteration, all outgoing edges of at least one new node will not have any zero delay. After at mostjVj iterations full-parallelism is achieved. 2

Let's examine the example in gure 3. The rst incremental retiming is shown in gure 5, where s has value (0;1). The next retiming must consider the new set of dependencies:

f(;2;1);(0;1);(1;0)g. We choose s = (1;3) then, the retiming function for node A can be either (;3;1) or (3;;1). Choosing r(A) = (;3;1) we obtain the graph on gure 9(a). The equivalent Fortran code is not a simple representation of the graph because it requires a recursion schedule vector [13], i.e., a schedule vector parallel to one of the axes in the index space. We useloop skewing[24, 25] to adjust the schedule vector. The code equivalent to the nal retimed graph is shown in gure 9(b).

4 Chained Multi-Dimensional Retiming

The previous section described a method to nd a fully parallel solution in several iterations of a incremental retiming operation. To increase the eciency of our algorithm, we now introduce the concepts that allow us to obtain the full parallelism solution in a single pass.

De nition 4.1

A chain is a path p=v0 ;e!0 v1 ;e!1 v2:::e;k ;1! vk, where all incoming edges for node v0 are non-zero delay edges and there exists at least one zero delay edge preceding

11

(13)

(b) DO 10 jp = 3, 4*n+m-1

B (-2,1) A D

C (0,1) (4,-1) (a)

(-3,1) (-3,1)

--- prologue ---

--- epilogue ---

DO 11 kp = max(3,jp-n), min(3*n+m-1,jp-3)

d(4*kp-3*jp+1,jp-kp) = b(4*kp-3jp+2,jp-kp-1) * c(4*kp-3*jp,jp-kp-1) a(4*kp-3*jp-3,jp-kp+1)= d(4*kp-3*jp-3,jp-kp+1) * .5

b(4*kp-3*jp,jp-kp) = a(4*kp-3*jp,jp-kp) + 1.

c(4*kp-3*jp,jp-kp) = a(4*kp-3*jp,jp-kp) + 2.

11 CONTINUE

10 CONTINUE

Figure 9: (a) MDFG fully parallel (b) equivalent Fortran code after loop skewing the nodes vi;1ik. The nodes are numbered in a monotonically increasing order, and k is said to be the length of the chain.

From the de nition above we derive a method for identifying all chains belonging to an MDFG. We call this construction a multi-chain structure and it is built according to the following steps:

1. all non-zero delay edges are removed from the MDFG. Since there is no zero-delay cycle for a realizable MDFG, the result is a directed acyclic graph (DAG).

2. a modi ed topological sort algorithm [22] is used to order the nodes in levels from the end to the beginning 2. Each node is labeled according to its level number, which produces the monotonically increasing characteristic of the node indices in a chain.

We call a multi-chain maximum length the highest level number obtained in the construc- tion process described above. An example of chain can be obtained from gure 3(a). The existing chains are D;A;B and D;A;C. Node D is at level 0, nodeA at level 1, and nodes B and C at level 2. Those chains have a multi-chain maximum length of 2. To speed up the incremental retiming process, we introduce the idea of a chained retiming according to the new algorithm below:

1. Find a legal retimingfunctionras in the rst step of the incrementalretimingalgorithm.

2. Construct the multi-chain structure, labeling the nodes accordingly and computing the multi-chain maximum length k.

3. Retime each node vi by (k;i)r. The result is the fully parallel MDFG.

2Every node is assigned to a unique level number, even though a node belongs to multiple chains

12

(14)

B (-3,1)

A D

C (-1,1) (1,0)

(a) (1,0)

(1,0)

DO 10 j = 0, n

d(0,j) = b(1,j-1) * c(-1,j-1) d(1,j) = b(2,j-1) * c(0,j-1) a(0,j) = d(0,j) * .5 DO 11 k = 0, m-2

d(k+2,j) = b(k+3,j-1) * c(k+1,j-1) a(k+1,j) = d(k+1,j) * .5 b(k,j) = a(k,j) + 1.

c(k,j) = a(k,j) + 2.

11 CONTINUE a(m,j) = d(m,j) * .5 b(m-1,j) = a(m-1,j) + 1.

c(m-1,j) = a(m-1,j) + 2.

b(m,j) = a(m,j) + 1.

c(m,j) = a(m,j) + 2.

10 CONTINUE (b)

prologue

epilogue

}

}

Figure 10: (a) MDFG fully parallel using chained retiming (b) equivalent Fortran code For this algorithm, only one step of incremental retiming is executed at the beginning. The rest of the algorithm requires only O(jEj) times to be performed. Therefore, this algorithm is more ecient than the incremental retiming. The next theorem proves that the MDFG obtained from the chained retiming algorithm is realizable.

Theorem 4.1

Let G = (V;E;d;t) be a realizable MDFG, the chained multi-dimensional retiming algorithm transforms G to Gr, such that Gr is realizable and fully parallel.

Proof: By using the chained multi-dimensional algorithm, we nd r an incremental retiming for G, k the multi-chain maximum length of G, and i the level number assigned to each node v 2 V. The retiming of each node vi by (k;i)r results in nal dependence vectors of the form d(e);j r, d(e) +j r, or j r where 1  j  k;1. By corollary 3.5, the retimed MDFG is realizable, and since there is no zero-delay edges, then Gr is fully parallel. 2

Revisiting the example in gure 3, we nd out that for k = 2 (length of existing chains) and r = (1;0) ( rst incremental retiming function) we could have applied only one retiming step in such way that node D would be retimed by 2(1;0), i.e., (2;0), and node A would be retimed by 1(1;0). The nal graph and Fortran code are shown in gure 10.

5 Experiments

In this section we present the application of our method to two di erent examples. Our rst example is the IIR lter introduced in section 1. The second example is an MDFG representing a wave digital lter. This MDFG computes the solution for a partial di erential equations problem (a transmission line problem), based on the Fettweis method [9].

13

(15)

M8 A4 A6 A5

M7

M6 A3

M5

M4 A2

M3

M2 A1

M1 A7

(1,0) (1,1) (2,-1) (2,0) (2,1) (3,-1) (3,0)

(2,-1)

(a) (3,1)

(2,-1)

(2,-1)

(2,-1)

(2,-1) (2,-1)

(2,-1)

(2,-1) (4,-2)

(4,-2)

(-3,2)

(-3,2) (-7,4)

M8 A4 A6 A5

M7

M6 A3

M5

M4 A2

M3

M2 A1

M1 A7

(1,0) (1,1) (2,-1) (2,0) (2,1) (3,-1) (3,0)

(2,-1)

(b) (3,1)

(2,-1)

(2,-1)

(2,-1)

(2,-1) (2,-1)

(2,-1)

(2,-1) (4,-2)

(4,-2)

(12,-6)

(-3,2) (8,-4) (-15,8)

Figure 11: (a) MDFG after retiming A6 (b) fully parallel MDFG after retiming A5

IIR Filter

We now revisit the example shown in gure 1. Assume both adders and multipliers take one time unit to compute. Using the incremental retiming concept, we begin by retiming all nodes labeled M by r = (;1;1) for a schedule vector s = (1;1). Figure 2(a) shows the resulting MDFG. The new set of dependencies D is now given by:

D =

"

3 3 3 2 2 2 1 1 ;1

1 0 ;1 1 0 ;1 1 0 1

#

After retiming nodes A1 to A4 by r = (;3;2), the graph presented in gure 2(b) is produced. In the next step, we retime node A6 by r = (;7;4) and we obtain the graph shown in gure 11(a). The nal graph is presented in gure 11(b) after retiming the node

A5 by r = (;15;8).

Our second approach is to use the chained retiming algorithm. We nd the topological or- dering of the DAG associated with the graph in gure 1. Nodes labeledM5 toM8 are assigned to level 0, nodes M3, M4, A3 and A4 to level 1, M1,M2, A2 and A6 to level 2, A1 and A5 to level 3, and nally,A7 to level 4. Therefore the multi-chain maximumlength of G is 4. We know that the rst incremental retiming function for this set of dependencies is (;1;1). Then we may apply the following retiming functions to the graph: r(Mi) = (;4;4) for i= 5 to8,

r(M3) = r(M4) = r(A3) = r(A4) = (;3;3), r(M1) = r(M2) = r(A2) =r(A6) = (;2;2), and r(A1) =r(A5) = (;1;1). The nal fully parallel graph is then shown in gure 12.

Transmission Line Simulation

14

(16)

M8 A4 A6 A5

M7

M6 A3

M5

M4 A2

M3

M2 A1

M1 A7

(2,-1) (2,0) (4,-3) (4,-2) (5,-2) (6,-4) (6,-3)

(-1,1) (6,-2)

(-1,1)

(-1,1)

(-1,1)

(-1,1) (-1,1)

(-1,1)

(-1,1) (-1,1)

(-1,1)

(-1,1)

(-1,1) (-1,1) (-1,1)

Figure 12: Fully parallel MDFG obtained through chained retiming

In this example, we introduce a two-dimensional transmission line problem initially trans- formed by using the Fettweis method (see [9] as a tutorial). The transmission line is charac- terized by the equations below:

l @i

1

@t

2

+r i1+ @u

@t

1

=f1

@i

1

@t

1

+c@u

@t

2

+gu=f2

After applying the Fettweis transformations we obtain the Wave Digital Filter shown in gure 13(a), which is equivalent to the MDFG in gure 13(b), where the inputs e1 ande2 are zero.

The two-port adaptors, A and B, are expanded to their internal con guration. For simplicity, one output port and one adder in each adaptor were deleted because the particular boundary conditions applied. Using the incremental algorithm we would obtain the graph shown in gure 14(a), after incrementally applying the following retiming functions: r(F) = r(E) = (1;0), r(H) = r(G) = (;3;1), r(B1) = r(A1) = (;7;2), r(B2) = r(A2) = (;15;4), and

r(B3) =r(A3) = (;31;8). Figure 14(b) shows the result of applying the chained retiming method to this case. The retimingfunctions werer(F) =r(E) = (5;0),r(H) =r(G) = (4;0),

r(B1) =r(A1) = (3;0),r(B2) =r(A2) = (2;0), and r(B3) =r(A3) = (1;0).

6 Conclusion

We have presented two novel techniques for optimizing a multi-dimensional data ow graph through the use of a multi-dimensional retiming method that we developed. We call these

15

(17)

e1

A B

e2

T1,T2

+ +

+ +

-

-

C D

G E F H -.5 -.5

V1-T1,T2 V2

(a)

D F H B1

C E G A1

A3 A2 B3 B2

Multipliers Adders (1,1)

(1,1) (-1,1)

(-1,1)

(b)

Figure 13: (a) Wave Digital Filter graph (b) equivalent MDFG

A3 A2 D F H B1

C E G A1 B3 B2

Multipliers Adders (-2,1)

(-2,1) (0,1)

(0,1)

(a)

(4,-1)

(4,-1) (4,-1)

(8,-2) (-16,4)

(4,-1) (8,-2) (-16,4) (-31,8)

(-31,8) (-31,8) (-31,8)

(28,-7) (28,-7)

D F H B1

C E G A1

A3 A2 B3 B2

Multipliers Adders (-6,1)

(-6,1) (-4,1)

(-4,1)

(b)

(1,0)

(1,0)

(1,0) (1,0) (1,0)

(1,0) (1,0) (1,0)

(1,0)

(1,0) (1,0)

(1,0)

(3,0) (3,0)

Figure 14: Final MDFG for the transmission line problem, (a) using incremental retiming (b) using chained retiming

16

References

Related documents

Throughout this thesis, research has focused on modelling mixed and hydrodynamic lubrication, in the field of tribology, and the application of the mixed lubrication model

In the eastern Pacific, important fisheries exist for walleye pollock, Pacific cod, Pacific hake, Pacific herring, Pacific halibut, sablefish, and Pacific ocean perch.. In the

The regiment as a whole was awarded the French military honor, the Croix de Guerre, and 171 of the officers and troops received individual citations for bravery, more than any

Studies reporting asbestos- related disease in household contact Shipyards Insulation includes construction Cement Manufacturing Auto mechanic?. Worker’s

Win-Situ Baro Merge™ software automatically subtracts BaroTROLL readings from data collected by an absolute Level TROLL instrument. Titanium

fully-random push protocol requires O(T · n log n) ran- dom bits for spreading a rumor within T rounds, it is not difficult to show that, for any graph with n nodes, there is a

In the first case, the interaction between members of a population is considered a continuous time function, while the second (which concerns typically excitable systems)

The implied rationale is that if we design collaborative systems around notions of space which mimic the spatial organisa- tion of the real world, then we can support the