In the context of multidimensional software pipelining, vector lifetimes with complex shapes present unusual challenges to register allocation. This arti- cle presents a systematic solution to address these challenges. The essential problems of lifetime normalization and representation were addressed to pre- cisely abstract the lifetimes. The conservative and aggressive distances were proposed to guide bin-packing and circumference minimization free of conflict with multiple fit strategies. The method subsumes the classical register alloca- tion for software pipelined single loops as a special case. Experiments indicate that the first fit strategy aided by aggressive distance effectively minimizes register usage with insignificant compile time.

The software pipelining technique and its register allocation has its strength and limitations. The case study in Section 9 has suggested several limiting fac- tors inherent in the loops. Some of them should be overcome by combining SSP with hierarchical reduction. We also need to improve the register alloca- tion approach with spilling and splitting. Then the performance of SSP requires

extensive evaluation. Combining SSP with high-level loop transformations like tiling and with hardware pipeline design like FPGA-based reconfigurable com- puting such that all levels of parallelism are exploited are important future directions. Finally, we are currently extending SSP to many core architec- tures to exploit both thread-level and instruction-level parallelism. With the paradigm shift from uniprocessors to many-core architectures, we envision the combination of software pipelining and multithreading as an inevitable trend.

APPENDIXES

APPENDIX A: FORMAL DEFINITIONS FOR LIFETIME REPRESENTATION
From Appendix A.1 to A.5, we introduce lifetime representation for the loop
and kernel models in Figures 5(a) and 5(b), where*l*1*= l*2 *= · · · = ln*. Then in

Appendix A.6, we extend the lifetime representation for the more general loop
nest and kernel model in Figures 20(a) and 20(b), where*l*1,*l*2,*. . ., and ln*are

not necessarily equal. In both cases, the kernel has a single II of*T cycles. We*
do not need to study the multi-II case because it can be treated as the single-II
case as discussed in Section 6.4.

For a variable, the core and derived parameters are defined from the refer-
ences set*REFS. For convenience, we refer to the elements in a reference ref*
as*time(ref ), type(ref ) ( USE or DEF), omega(ref ), and level(ref ), respectively.*
Also let*latency(ref ) be the latency of the operation that contains the reference*
(as an operand).

A.1 SingleStart, SingleEnd, Omega, and Alpha

*singleStart= min**time(r),∀r ∈ REFSs.t. t ype(r) =* DEF

*singleEnd= max**time(r)+ latency(r) + omega(r) ∗ T, ∀r ∈ REFS*
*omega= max**omega(r),∀r ∈ REFS*

*SingleStart and singleEnd are equivalent to the start and end time used*
in traditional register allocation for single loops [Rau et al. 1992], described
in Section 2.2. In defining*singleEnd, not only uses but also definitions must*
be considered. In case a variable is defined but not used, the end time of that
variable should be the completion time of the last definition.

In a healthy program, only a reference at the outermost loop level, which is
software pipelined, can have a nonzero*omega. The loop nest in Figure 3(a) is a*
typical example. Also,*omega of a definition is always 0.*

*Alpha depends on the code outside the loop nest and cannot be computed*
from the references. We assume it has already been given by an earlier phase
of the compiler.

A.2 Start and End

Given a loop level *i* *∈ [1, n], if level i has no definition in REFS, we set*
(start[i], end [i]) *= (+∞, −∞). Otherwise, let the definition be d ∈ REFS,*

then

*start[i]= time(d)*

*end[i]= max(time(d) + latency(d),*

*time(u)+ latency(u) + loopbackOffset(d, u), ∀u ∈ USES(d)),*
where*USES(d) is the set of all possible uses of d , that is, all the uses that can*
be reached by*d in the control flow graph, loopBackOffset(d , u) is an extra offset*
when*d reaches a use u along a back edge of a loop.*

In terms of dataflow analysis, a point *x in the control flow graph reaches*
another point *y if there is a path from x to y without any definition of the vari-*
able under consideration between them. The control flow graph in this article
is the graph for the original loop nest (before software pipelining) defined in
Figure 5(a), which contains*n back edges, one for each loop.*

*LoopbackOffset(x, y) is the offset if point x reaches point y along a back edge*
of a loop, either the outermost loop or any of the inner loops. It is defined as

*loopBackOffset(x, y)*=
⎧
⎪
⎪
⎪
⎪
⎨
⎪
⎪
⎪
⎪
⎩

*omega( y)∗ T* ifxreaches yvia the back edge of
the outermost loop.

*max(Sj∗ T)* if*xreaches yvia the back edge of*

any inner loop*Lj*(2*≤ j ≤ n).*

0 otherwise.

For the second case, where*x reaches y via the back edge of an inner loop Lj*,

since the inner loop runs sequentially, the offset due to this loop is *Sj* *∗ T. For*

example, in Figure 11(d), operation*c defines TNx; this definition reaches two*
uses, one is a source operand of operation*b, and the other is a source operand of*
operation*c. Take the latter use, for instance. The definition can reach it along*
the back edge of either loop*L*2(via path 2) or loop*L*3(via path 3), and therefore,

the offset equals*max(S*2*∗ T, S*3*∗ T) = S*2*∗ T.*

A.3 NextStart

Given a loop level*i∈ [1, n], let x be a point at this level. If level i has a definition*
*d* *∈ REFS, let x = d. Otherwise, let x be the starting point of loop Li*, that is,

*x is at the first cycle of the first stage of Li*. Then

*nextStart[i]*=
⎧
⎪
⎪
⎨
⎪
⎪
⎩

*min{ time(d*_{)}_{+ loopbackOffset(x, d}_{),}

*∀d*_{∈ NEXTDEFS(x)}}

If*NEXTDEFS(x) is not empty*
+∞ Otherwise

where*NEXTDEFS(x) is the set of all possible next definitions of the variable*
in the same iteration as point*x. The back edge of the outermost loop should not*
be followed in calculating this set. Otherwise, the set may contain a definition
from the next iteration. Formally,

*NEXTDEFS(x)= { y| y ∈ REFS and type( y) =* DEFand*x can reach y*

A.4 FirstStretch and LastStretch

Take the first ILES as a point of reference. Let*x*≥ 0 be the iteration index of a
stretched interval at loop level*i. The interval is stretched if it is defined before*
the first cycle of the first ILES but its last use reaches or goes beyond that cycle
and is not within the first group of iterations. For example, in Figure 4, the
interval of TN2 defined by operation*b*3spans across the first ILES, and the use

*c*3,0is in the second group. Formally, the conditions are:

*start[i]+ x ∗ T < ln∗ T,*

*end[i]+ x ∗ T > ln∗ T,*

*x+ omega ≥ Sn*when*i*= 1,

*x* *≥ Sn*when*i> 1,*

andx *≥ 0.*

Let*first and last be the smallest and largest solutions of x to the prior inequal-*
ities. That is,

*first= min*
*∀i∈[1,n]max*
0,*Sn− span, ln*−
*end [i]*− 1
*T*
*last*= *max*
*∀i∈[1,n]*
*ln*−
*start[i]*+ 1
*T*
,

where*span= omega if i = 1, or span = 0 otherwise.*

Note that when*n* *= 1 or first > last, there is no stretched interval at all.*
Thus we define

(firstStretch, lastStretch)=

(*+∞, −∞) if n = 1or first > last*
(first, last) otherwise.

A.5 Top and Bottom

Again take the first ILES as a point of reference. If *outermostIntervalOnly*
is true, the segment is made of only stretched intervals, if any. Otherwise,
there are*Sn*scalar lifetimes from the first group of iterations, followed by the

stretched intervals, if any. For example, in Figure 6, TN1 has only a stretched
interval in the first ILES, while TN2 has *Sn* = 3 scalar lifetimes and two

stretched intervals. Therefore,

(top, bottom)=

(firstStretch, lastStretch*+ 1) if outermostIntervalOnly*
(0,*max(Sn*,*lastStretch*+ 1)) otherwise,

Note that if*outermostIntervalOnly is true, and no interval is stretched (i.e.,*
*(firstStretch, lastStretch)=(+∞, −∞)), the ILES would have no interval in it.*
In this case, according to the formulas, (top, bottom) would be (+∞, −∞), as
expected.

A.6 Extensions

This section describes in more detail the parameters under the more general loop nest and kernel model in Figure 20(a) and 20(b).

A.6.1 *Core Parameters. Since a loop level is allowed to have two intervals,*
*start, end , and next Start are extended to be 2-dimensional arrays. For the*
interval defined at*PosA (PosB) at loop level i and the hole following it, it is de-*
scribed by *start[PosA][i], end[PosA][i], and nextStart[PosA][i] ( start[PosB][i],*
*end[PosB][i], and nextStart[PosB][i]). The meaning of these parameters are*
exactly the same as the *start[i], end[i], and nextStart[i] introduced in Ap-*
pendix A.2 and A.3 except that the interval starts at the specific position. When
*i= n, we set start[PosA][i], end[PosA][i], and nextStart[PosA][i] to be the same*
as*start[PosB][i], end[PosB][i], and nextStart[PosB][i], respectively.*

One difficulty is the issue time of the definition for computing*start[P osB][i].*
If *i* *< n, this definition is after the loops Ln*, *Ln*−1,*. . . , and Li*+1, and thus

its issue time is dependent on the total execution time of these loops which
may be unknown at compile time. Similar problems exist for*end [P osB][i] and*
*next Start[P osB][i].*

Fortunately, it is not necessary to calculate the absolute issue time; a rela-
tive time is sufficient for our method. Note that the purpose for representing
an interval *IA* of a vector lifetime *A is for computing INTL[ A, B] where it is*

interleaved with the corresponding interval *IB*of a vector lifetime *B. Look at*

the formulas (10) and (11). What we need is only some relative differences, for
example,*end (IA*)*− nextStart(IB*). Since both*IA*and*IB*are from the same po-

sition of the same loop level, it is sufficient to keep the differences if an issue time is expressed relative to the same reference point.

For this reason, we choose to imagine that each of the loops before the inter-
val,*Ln*,*Ln*−1,*. . . , and Li*+1, has one iteration (i.e.,*Nn* *= Nn*−1*= . . . = Ni*+1= 1).

And therefore, the issue time of the definition of the interval is just the 1-D schedule time of it. Similarly, the issue time of any use of the definition of the interval or the issue time of any definition following the interval is also the 1-D schedule time except that there may be loop-back offsets as described before.

A.6.2 *Derived Parameters. Derived parameters have two more elements*
to describe the extended intervals:*firstExtend and lastExtend. There are also*
minor changes in other parameters described in the following.

Stretching happens during the filling of an iteration, and thus only for the
intervals defined at*P os A. For the formulas in Appendix A.4, we need to change*
*start[i] and end[i] to start[PosA][i] and end[PosA][i], respectively.*

In contrast, extended intervals appear during the draining of an iteration,
and thus only for the intervals defined at *PosB. Now we show how to calcu-*
late*firstExtend and lastExtend. As before, we take the first ILES as reference.*
Imagine that before the first iteration (iteration 0), there can be other itera-
tions with negative indexes. Let *x* *< 0 be the iteration index of an extended*
interval at loop level *i. The interval is extended if its definition is before the*
first ILES, but its last use reaches to or beyond the first cycle of that segment.
More formally,

*start[P osB][i]+ x ∗ T < ln∗ T*

*end[P osB][i]+ x ∗ T > ln∗ T*

As the direct solutions to these inequalities, let
*firstEx= min*
*∀i(1≤i≤n)*
*ln*−
*end [P osB][i]*− 1
*T*
*lastEx= max*
*∀i(1≤i≤n)min*
*−1, ln*−
*start[P osB][i]*+ 1
*T*
*.*

Note that when*n= 1 or firstEx > lastEx, there is no extended interval at*
all. Therefore we define:

(firstExtend, lastExtend)=

(+∞, −∞) if*n= 1or firstEx > lastEx*
(firstEx, lastEx) otherwise*.*

Next, the*top and bottom of the first ILES may be changed in the new situa-*
tion. Let*oldTop and oldBottom be the top and bottom defined in Appendix A.5,*
which consider the stretched intervals only. Then for the current situation, we
need to further consider the extended intervals as well:

*top= min(oldTop, firstExtend)*

*bottom= max(oldBottom, lastExtend + 1).*
APPENDIX B: INITIALIZATION OF LOOP CONTROL REGISTERS
IN CODE GENERATION

The control registers,*LC and EC, must be initialized correctly. A wrong set-*
ting can make the loop *L*_{1} run infinitely because the operation *br.ctop L*_{1}
before the epilog might never have the right condition to fall through to the
epilog.

*LC and EC are first initialized at the beginning of the schedule. We assert*
that*LC and EC must have already been 0 before the epilog. In order to drain*
the remaining operation instances, we reinitialize*EC such that after the epilog*
is finished, both*LC and EC are 0 again. We show how to find the correct initial*
values of the two registers in both cases.

First, we discuss how to initialize *LC and EC at the beginning of the*
schedule. Take Figure 33 as an example. Note the operation*br.ctop L*_{1}before
the epilog. It is executed after an ILES. If this is the last ILES, then right before
this branch operation,*LC should be 0 and EC should be 1. In this condition,*
after the branch, both *LC and EC become 0, and the branch falls through to*
the epilog. We assert the conditions before and after this branch for clarity.

From the conditions, we can derive the initial setting of the loop control
registers. There are totally*N*1number of iterations. Therefore,*LC is initialized*

to:

*LC= N*1*− 1.*

Assume *EC has an initial value of x. We consider how the (LC, EC) pair*
changes. One*br.ctop changes it to (LC−1, EC) when LC is not 0, or (LC, EC−1)*
otherwise. Therefore, the pair changes in the whole process from (N1*− 1, x) to*

(0,*x), after N*1*− 1 number of br.ctop operations; then from (0, x) to (0, 0) after*

of*br.ctop operations that have been executed is:*
*N*1*− 1 + x.*

On the other hand, we have *ln* *− Sn* number of *br.ctop operations in the*

prolog, and*N*1

*Sn number of OLPs, with each OLP taking Sn* number of*br.ctop*

operations including the one after the corresponding ILES. Therefore, the total
number of*br.ctop operations that have been executed before the epilog can also*
be stated as:
*ln− Sn*+
*N*1
*Sn*
*∗ Sn.*

From the equality, we can easily find that*EC should be initialized to*
*x= ln− ((N*1*− 1) mod Sn*)*.*

This formula generalizes the one in our previous publication [Rong et al.
2004] where the assumption was *l*1 *= l*2 *= · · · = ln*, while here they are not

necessarily equal.

Now let us consider the initialization for the epilog. Before the initialization,
both*LC and EC are 0. Since there is no new iteration, LC should continue to*
be 0, while*EC should enable the remaining stages to drain. In general, there*
are*l*1*− fn*+ 1 number of partial kernels in the epilog. After each of the partial

kernels, we need a*br.ctop operation to rotate registers. Therefore, EC should*
be initialized to

*l*1*− fn+ 1.*

ACKNOWLEDGMENTS

We are grateful to the anonymous reviewers, and to Chan Sun, Ross Towle, Shuxin Yang, Jose Nelson Amaral, R. Govindarajan, Jean C. Beyler, and Michel Strasser for their valuable comments.

REFERENCES

AIKEN, A., NICOLAU, A.,ANDNOVACK, S. 1995. Resource-constrained software pipelining.*IEEE*
*Trans. Parall. Distrib. Syst. 6, 12, 1248–1270.*

ALLAN, V. H., JONES, R. B., LEE, R. M.,ANDALLAN, S. J. 1995. Software pipelining.*ACM Comput.*
*Surv. 27, 3, 367–432.*

ALLEN, J. R., KENNEDY, K., PORTERFIELD, C.,ANDWARREN, J. 1983. Conversion of control depen-

dence to data dependence. In*Proceedings of the 10th Annual ACM Symposium on Principles of*
*Programming Languages. 177–189.*

AUSLANDER, M.ANDHOPKINS, M. 2004. An overview of the pl.8 compiler.*SIGPLAN Notices 39, 4,*

38–48.

CALLAHAN, D.ANDKOBLENZ, B. 1991. Register allocation via hierarchical graph coloring. In*Pro-*
*ceedings of the ACM SIGPLAN 1991 Conference on Programming Language Design and Imple-*
*mentation (PLDI’91). ACM Press, 192–203.*

CARR, S., DING, C., AND SWEANY, P. 1996. Improving software pipelining with unroll-and-jam.

In *Proceedings of the 29th Hawaii International Conference on System Sciences (HICSS’96),*
*Software Technology and Architecture, vol. 1. IEEE Computer Society, 183.*

CHAITIN, G. 2004. Register allocation and spilling via graph coloring.*SIGPLAN Notices 39, 4,*
66–74.

CHENG, W.-K.ANDLIN, Y.-L. 1999. Code generation of nested loops for dsp processors with hetero-

geneous registers and structural pipelining.*ACM Trans. Des. Autom. Electro. Syst. 4, 3, 231–256.*
CYTRON, R., FERRANTE, J., ROSEN, B. K., WEGMAN, M. N.,ANDZADECK, F. K. 1991. Efficiently com-

puting static single assignment form and the control dependence graph.*ACM Trans. Program.*
*Lang. Syst. 13, 4, 451–490.*

DARTE, A.ANDROBERT, Y. 1994. Constructive methods for scheduling uniform loop nests.*IEEE*
*Trans. Parall. Distrib. Syst. 5, 8, 814–822.*

DEHNERT, J. C.ANDTOWLE, R. A. 1993. Compiling for the cydra 5.*J. Supercomput. 7, 1-2, 181–227.*
DOUILLET, A.ANDGAO, G. R. 2005. Register pressure in software-pipelined loop nests: Fast com-
putation and impact on architecture design. In*The 18th International Workshop on Languages*
*and Compilers for Parallel Computing (LCPC’05). Hawthorne, NY, 17–31.*

DOUILLET, A.ANDGAO, G. R. 2007. Software-pipelining on multi-core architectures. In*Proceedings*
*of the 16th International Conference on Parallel Architecture and Compilation Techniques (PACT*
*2007). IEEE Computer Society, 39–48.*

EBCIOGLU, K.ANDNAKATANI, T. 1990. A new compilation technique for parallelizing loops with

unpredictable branches on a vliw architecture. In*Selected Papers of the Second Workshop on*
*Languages and Compilers for Parallel Computing. Pitman Publishing, London, UK, 213–229.*
GAO, G. R., NING, Q.,ANDDONGEN, V. V. 1993. Software pipelining for nested loops. ACAPS Tech

Memo 53, School of Computer Science, McGill Univ., Montr´eal, Qu´ebec.

HENDREN, L. J., GAO, G. R., ALTMAN, E. R.,ANDMUKERJI, C. 1992. A register allocation framework

based on hierarchical cyclic interval graphs. In*Proceedings of the 4th International Conference*
*on Compiler Construction (CC ’92). Springer-Verlag, 176–191.*

HUFF, R. A. 1993. Lifetime-sensitive modulo scheduling. In *Proceedings of the ACM SIG-*
*PLAN 1993 Conference on Programming Language Design and Implementation (PLDI’93). Al-*
buquerque, 258–267.

INTEL. 2001. *Intel IA-64 Architecture Software Developer’s Manual. Vol. 1: IA-64 Application*
Architecture. Intel Corporation, Santa Clara, CA.

LAM, M. 1988. Software pipelining: An effective scheduling technique for VLIW machines. In
*Proceedings of the ACM SIGPLAN 1988 Conference on Programming Language Design and Im-*
*plementation (PLDI’88). 318–328.*

LAMPORT, L. 1974. The parallel execution of DO loops.*Comm. ACM 17, 2, 83–93.*

LAWLER, E. L., LENSTRA, J. K., KHAN, A. H. G. R.,ANDSHMOYS, D. B. 1985. *The Traveling Salesman*
*Problem: A Guided Tour of Combinatorial Optimization. John Wiley & Sons.*

MOON, S.-M.ANDEBCIOGLU˘ , K. 1997. Parallelizing nonnumerical code with selective scheduling

and software pipelining.*ACM Trans. Program. Lang. Syst. 19, 6, 853–898.*

MUTHUKUMAR, K.ANDDOSHI, G. 2001. Software pipelining of nested loops. Lecture Notes in Com-

puter Science, Vol.*2027, 165–181.*

RAMANUJAM, J. 1994. Optimal software pipelining of nested loops. In*Proceedings of the 8th In-*
*ternational Parallel Processing Symposium. IEEE, 335–342.*

RAU, B. R. 1994. Iterative modulo scheduling: An algorithm for software pipelining loops. In
*Proceedings of the 27th Annual International Symposium on Microarchitecture. San Jose, CA,*
63–74.

RAU, B. R.ANDFISHER, J. A. 1993. Instruction-level parallel processing: History, overview and
perspective.*J. Supercomput. 7, 9–50.*

RAU, B. R., LEE, M., TIRUMALAI, P. P.,ANDSCHLANSKER, M. S. 1992. Register allocation for modulo scheduled loops: Strategies, algorithms and heuristics. HP Labs Tech. rep. HPL-92-48, Hewlett- Packard Laboratories, Palo Alto, CA.

RONG, H., DOUILLET, A., GOVINDARAJAN, R., AND GAO, G. R. 2004. Code generation for single-

dimension software pipelining of multi-dimensional loops. In*Proceedings of the International*
*Symposium on Code Generation and Optimization (CGO’04). IEEE Computer Society, 175–*
186.

RONG, H.ANDGOVINDARAJAN, R. 2007. Advances in software piplining. In*The Compiler Design*
*Handbook: Optimization and Machine Code Generation, 2nd Ed. Y. N. Srikant and P. Shankar,*
Eds. CRC, Chapter 20.

RONG, H., TANG, Z., GOVINDARAJAN, R., DOUILLET, A., AND GAO, G. R. 2007a. Single-dimension

of Electrical and Computer Engineering, University of Delaware, Newark, DE. In ftp://ftp.capsl.udel.edu/pub/doc/memos/memo049.ps.gz.

RONG, H., TANG, Z., GOVINDARAJAN, R., DOUILLET, A.,ANDGAO, G. R. 2007b. Single-dimension soft-

ware pipelining for multidimensional loops.*ACM Trans. Architec. Code Optim. 4, 1, 7.*

TURKINGTON, K., MASSELOS, K., CONSTANTINIDES, G. A.,ANDLEONG, P. 2006. FPGA based acceleration
of the linpack benchmark: A high level code transformation approach. In *Proceedings of the*
*International Conference on Field Programmable Logic and Applications (FPL). Madrid, Spain.*
IEEE, 1–6.

WANG, J.AND GAO, G. R. 1996. Pipelining-dovetailing: A transformation to enhance software
pipelining for nested loops. In*Proceedings of the 6th International Conference on Compiler Con-*
*struction (CC ’96). Springer-Verlag, London, UK, 1–17.*