In the context of multidimensional software pipelining, vector lifetimes with complex shapes present unusual challenges to register allocation. This arti- cle presents a systematic solution to address these challenges. The essential problems of lifetime normalization and representation were addressed to pre- cisely abstract the lifetimes. The conservative and aggressive distances were proposed to guide bin-packing and circumference minimization free of conflict with multiple fit strategies. The method subsumes the classical register alloca- tion for software pipelined single loops as a special case. Experiments indicate that the first fit strategy aided by aggressive distance effectively minimizes register usage with insignificant compile time.
The software pipelining technique and its register allocation has its strength and limitations. The case study in Section 9 has suggested several limiting fac- tors inherent in the loops. Some of them should be overcome by combining SSP with hierarchical reduction. We also need to improve the register alloca- tion approach with spilling and splitting. Then the performance of SSP requires
extensive evaluation. Combining SSP with high-level loop transformations like tiling and with hardware pipeline design like FPGA-based reconfigurable com- puting such that all levels of parallelism are exploited are important future directions. Finally, we are currently extending SSP to many core architec- tures to exploit both thread-level and instruction-level parallelism. With the paradigm shift from uniprocessors to many-core architectures, we envision the combination of software pipelining and multithreading as an inevitable trend.
APPENDIXES
APPENDIX A: FORMAL DEFINITIONS FOR LIFETIME REPRESENTATION From Appendix A.1 to A.5, we introduce lifetime representation for the loop and kernel models in Figures 5(a) and 5(b), wherel1= l2 = · · · = ln. Then in
Appendix A.6, we extend the lifetime representation for the more general loop nest and kernel model in Figures 20(a) and 20(b), wherel1,l2,. . ., and lnare
not necessarily equal. In both cases, the kernel has a single II ofT cycles. We do not need to study the multi-II case because it can be treated as the single-II case as discussed in Section 6.4.
For a variable, the core and derived parameters are defined from the refer- ences setREFS. For convenience, we refer to the elements in a reference ref astime(ref ), type(ref ) ( USE or DEF), omega(ref ), and level(ref ), respectively. Also letlatency(ref ) be the latency of the operation that contains the reference (as an operand).
A.1 SingleStart, SingleEnd, Omega, and Alpha
singleStart= mintime(r),∀r ∈ REFSs.t. t ype(r) = DEF
singleEnd= maxtime(r)+ latency(r) + omega(r) ∗ T, ∀r ∈ REFS omega= maxomega(r),∀r ∈ REFS
SingleStart and singleEnd are equivalent to the start and end time used in traditional register allocation for single loops [Rau et al. 1992], described in Section 2.2. In definingsingleEnd, not only uses but also definitions must be considered. In case a variable is defined but not used, the end time of that variable should be the completion time of the last definition.
In a healthy program, only a reference at the outermost loop level, which is software pipelined, can have a nonzeroomega. The loop nest in Figure 3(a) is a typical example. Also,omega of a definition is always 0.
Alpha depends on the code outside the loop nest and cannot be computed from the references. We assume it has already been given by an earlier phase of the compiler.
A.2 Start and End
Given a loop level i ∈ [1, n], if level i has no definition in REFS, we set (start[i], end [i]) = (+∞, −∞). Otherwise, let the definition be d ∈ REFS,
then
start[i]= time(d)
end[i]= max(time(d) + latency(d),
time(u)+ latency(u) + loopbackOffset(d, u), ∀u ∈ USES(d)), whereUSES(d) is the set of all possible uses of d , that is, all the uses that can be reached byd in the control flow graph, loopBackOffset(d , u) is an extra offset whend reaches a use u along a back edge of a loop.
In terms of dataflow analysis, a point x in the control flow graph reaches another point y if there is a path from x to y without any definition of the vari- able under consideration between them. The control flow graph in this article is the graph for the original loop nest (before software pipelining) defined in Figure 5(a), which containsn back edges, one for each loop.
LoopbackOffset(x, y) is the offset if point x reaches point y along a back edge of a loop, either the outermost loop or any of the inner loops. It is defined as
loopBackOffset(x, y)= ⎧ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎩
omega( y)∗ T ifxreaches yvia the back edge of the outermost loop.
max(Sj∗ T) ifxreaches yvia the back edge of
any inner loopLj(2≤ j ≤ n).
0 otherwise.
For the second case, wherex reaches y via the back edge of an inner loop Lj,
since the inner loop runs sequentially, the offset due to this loop is Sj ∗ T. For
example, in Figure 11(d), operationc defines TNx; this definition reaches two uses, one is a source operand of operationb, and the other is a source operand of operationc. Take the latter use, for instance. The definition can reach it along the back edge of either loopL2(via path 2) or loopL3(via path 3), and therefore,
the offset equalsmax(S2∗ T, S3∗ T) = S2∗ T.
A.3 NextStart
Given a loop leveli∈ [1, n], let x be a point at this level. If level i has a definition d ∈ REFS, let x = d. Otherwise, let x be the starting point of loop Li, that is,
x is at the first cycle of the first stage of Li. Then
nextStart[i]= ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩
min{ time(d)+ loopbackOffset(x, d),
∀d∈ NEXTDEFS(x)}
IfNEXTDEFS(x) is not empty +∞ Otherwise
whereNEXTDEFS(x) is the set of all possible next definitions of the variable in the same iteration as pointx. The back edge of the outermost loop should not be followed in calculating this set. Otherwise, the set may contain a definition from the next iteration. Formally,
NEXTDEFS(x)= { y| y ∈ REFS and type( y) = DEFandx can reach y
A.4 FirstStretch and LastStretch
Take the first ILES as a point of reference. Letx≥ 0 be the iteration index of a stretched interval at loop leveli. The interval is stretched if it is defined before the first cycle of the first ILES but its last use reaches or goes beyond that cycle and is not within the first group of iterations. For example, in Figure 4, the interval of TN2 defined by operationb3spans across the first ILES, and the use
c3,0is in the second group. Formally, the conditions are:
start[i]+ x ∗ T < ln∗ T,
end[i]+ x ∗ T > ln∗ T,
x+ omega ≥ Snwheni= 1,
x ≥ Snwheni> 1,
andx ≥ 0.
Letfirst and last be the smallest and largest solutions of x to the prior inequal- ities. That is,
first= min ∀i∈[1,n]max 0,Sn− span, ln− end [i]− 1 T last= max ∀i∈[1,n] ln− start[i]+ 1 T ,
wherespan= omega if i = 1, or span = 0 otherwise.
Note that whenn = 1 or first > last, there is no stretched interval at all. Thus we define
(firstStretch, lastStretch)=
(+∞, −∞) if n = 1or first > last (first, last) otherwise.
A.5 Top and Bottom
Again take the first ILES as a point of reference. If outermostIntervalOnly is true, the segment is made of only stretched intervals, if any. Otherwise, there areSnscalar lifetimes from the first group of iterations, followed by the
stretched intervals, if any. For example, in Figure 6, TN1 has only a stretched interval in the first ILES, while TN2 has Sn = 3 scalar lifetimes and two
stretched intervals. Therefore,
(top, bottom)=
(firstStretch, lastStretch+ 1) if outermostIntervalOnly (0,max(Sn,lastStretch+ 1)) otherwise,
Note that ifoutermostIntervalOnly is true, and no interval is stretched (i.e., (firstStretch, lastStretch)=(+∞, −∞)), the ILES would have no interval in it. In this case, according to the formulas, (top, bottom) would be (+∞, −∞), as expected.
A.6 Extensions
This section describes in more detail the parameters under the more general loop nest and kernel model in Figure 20(a) and 20(b).
A.6.1 Core Parameters. Since a loop level is allowed to have two intervals, start, end , and next Start are extended to be 2-dimensional arrays. For the interval defined atPosA (PosB) at loop level i and the hole following it, it is de- scribed by start[PosA][i], end[PosA][i], and nextStart[PosA][i] ( start[PosB][i], end[PosB][i], and nextStart[PosB][i]). The meaning of these parameters are exactly the same as the start[i], end[i], and nextStart[i] introduced in Ap- pendix A.2 and A.3 except that the interval starts at the specific position. When i= n, we set start[PosA][i], end[PosA][i], and nextStart[PosA][i] to be the same asstart[PosB][i], end[PosB][i], and nextStart[PosB][i], respectively.
One difficulty is the issue time of the definition for computingstart[P osB][i]. If i < n, this definition is after the loops Ln, Ln−1,. . . , and Li+1, and thus
its issue time is dependent on the total execution time of these loops which may be unknown at compile time. Similar problems exist forend [P osB][i] and next Start[P osB][i].
Fortunately, it is not necessary to calculate the absolute issue time; a rela- tive time is sufficient for our method. Note that the purpose for representing an interval IA of a vector lifetime A is for computing INTL[ A, B] where it is
interleaved with the corresponding interval IBof a vector lifetime B. Look at
the formulas (10) and (11). What we need is only some relative differences, for example,end (IA)− nextStart(IB). Since bothIAandIBare from the same po-
sition of the same loop level, it is sufficient to keep the differences if an issue time is expressed relative to the same reference point.
For this reason, we choose to imagine that each of the loops before the inter- val,Ln,Ln−1,. . . , and Li+1, has one iteration (i.e.,Nn = Nn−1= . . . = Ni+1= 1).
And therefore, the issue time of the definition of the interval is just the 1-D schedule time of it. Similarly, the issue time of any use of the definition of the interval or the issue time of any definition following the interval is also the 1-D schedule time except that there may be loop-back offsets as described before.
A.6.2 Derived Parameters. Derived parameters have two more elements to describe the extended intervals:firstExtend and lastExtend. There are also minor changes in other parameters described in the following.
Stretching happens during the filling of an iteration, and thus only for the intervals defined atP os A. For the formulas in Appendix A.4, we need to change start[i] and end[i] to start[PosA][i] and end[PosA][i], respectively.
In contrast, extended intervals appear during the draining of an iteration, and thus only for the intervals defined at PosB. Now we show how to calcu- latefirstExtend and lastExtend. As before, we take the first ILES as reference. Imagine that before the first iteration (iteration 0), there can be other itera- tions with negative indexes. Let x < 0 be the iteration index of an extended interval at loop level i. The interval is extended if its definition is before the first ILES, but its last use reaches to or beyond the first cycle of that segment. More formally,
start[P osB][i]+ x ∗ T < ln∗ T
end[P osB][i]+ x ∗ T > ln∗ T
As the direct solutions to these inequalities, let firstEx= min ∀i(1≤i≤n) ln− end [P osB][i]− 1 T lastEx= max ∀i(1≤i≤n)min −1, ln− start[P osB][i]+ 1 T .
Note that whenn= 1 or firstEx > lastEx, there is no extended interval at all. Therefore we define:
(firstExtend, lastExtend)=
(+∞, −∞) ifn= 1or firstEx > lastEx (firstEx, lastEx) otherwise.
Next, thetop and bottom of the first ILES may be changed in the new situa- tion. LetoldTop and oldBottom be the top and bottom defined in Appendix A.5, which consider the stretched intervals only. Then for the current situation, we need to further consider the extended intervals as well:
top= min(oldTop, firstExtend)
bottom= max(oldBottom, lastExtend + 1). APPENDIX B: INITIALIZATION OF LOOP CONTROL REGISTERS IN CODE GENERATION
The control registers,LC and EC, must be initialized correctly. A wrong set- ting can make the loop L1 run infinitely because the operation br.ctop L1 before the epilog might never have the right condition to fall through to the epilog.
LC and EC are first initialized at the beginning of the schedule. We assert thatLC and EC must have already been 0 before the epilog. In order to drain the remaining operation instances, we reinitializeEC such that after the epilog is finished, bothLC and EC are 0 again. We show how to find the correct initial values of the two registers in both cases.
First, we discuss how to initialize LC and EC at the beginning of the schedule. Take Figure 33 as an example. Note the operationbr.ctop L1before the epilog. It is executed after an ILES. If this is the last ILES, then right before this branch operation,LC should be 0 and EC should be 1. In this condition, after the branch, both LC and EC become 0, and the branch falls through to the epilog. We assert the conditions before and after this branch for clarity.
From the conditions, we can derive the initial setting of the loop control registers. There are totallyN1number of iterations. Therefore,LC is initialized
to:
LC= N1− 1.
Assume EC has an initial value of x. We consider how the (LC, EC) pair changes. Onebr.ctop changes it to (LC−1, EC) when LC is not 0, or (LC, EC−1) otherwise. Therefore, the pair changes in the whole process from (N1− 1, x) to
(0,x), after N1− 1 number of br.ctop operations; then from (0, x) to (0, 0) after
ofbr.ctop operations that have been executed is: N1− 1 + x.
On the other hand, we have ln − Sn number of br.ctop operations in the
prolog, andN1
Sn number of OLPs, with each OLP taking Sn number ofbr.ctop
operations including the one after the corresponding ILES. Therefore, the total number ofbr.ctop operations that have been executed before the epilog can also be stated as: ln− Sn+ N1 Sn ∗ Sn.
From the equality, we can easily find thatEC should be initialized to x= ln− ((N1− 1) mod Sn).
This formula generalizes the one in our previous publication [Rong et al. 2004] where the assumption was l1 = l2 = · · · = ln, while here they are not
necessarily equal.
Now let us consider the initialization for the epilog. Before the initialization, bothLC and EC are 0. Since there is no new iteration, LC should continue to be 0, whileEC should enable the remaining stages to drain. In general, there arel1− fn+ 1 number of partial kernels in the epilog. After each of the partial
kernels, we need abr.ctop operation to rotate registers. Therefore, EC should be initialized to
l1− fn+ 1.
ACKNOWLEDGMENTS
We are grateful to the anonymous reviewers, and to Chan Sun, Ross Towle, Shuxin Yang, Jose Nelson Amaral, R. Govindarajan, Jean C. Beyler, and Michel Strasser for their valuable comments.
REFERENCES
AIKEN, A., NICOLAU, A.,ANDNOVACK, S. 1995. Resource-constrained software pipelining.IEEE Trans. Parall. Distrib. Syst. 6, 12, 1248–1270.
ALLAN, V. H., JONES, R. B., LEE, R. M.,ANDALLAN, S. J. 1995. Software pipelining.ACM Comput. Surv. 27, 3, 367–432.
ALLEN, J. R., KENNEDY, K., PORTERFIELD, C.,ANDWARREN, J. 1983. Conversion of control depen-
dence to data dependence. InProceedings of the 10th Annual ACM Symposium on Principles of Programming Languages. 177–189.
AUSLANDER, M.ANDHOPKINS, M. 2004. An overview of the pl.8 compiler.SIGPLAN Notices 39, 4,
38–48.
CALLAHAN, D.ANDKOBLENZ, B. 1991. Register allocation via hierarchical graph coloring. InPro- ceedings of the ACM SIGPLAN 1991 Conference on Programming Language Design and Imple- mentation (PLDI’91). ACM Press, 192–203.
CARR, S., DING, C., AND SWEANY, P. 1996. Improving software pipelining with unroll-and-jam.
In Proceedings of the 29th Hawaii International Conference on System Sciences (HICSS’96), Software Technology and Architecture, vol. 1. IEEE Computer Society, 183.
CHAITIN, G. 2004. Register allocation and spilling via graph coloring.SIGPLAN Notices 39, 4, 66–74.
CHENG, W.-K.ANDLIN, Y.-L. 1999. Code generation of nested loops for dsp processors with hetero-
geneous registers and structural pipelining.ACM Trans. Des. Autom. Electro. Syst. 4, 3, 231–256. CYTRON, R., FERRANTE, J., ROSEN, B. K., WEGMAN, M. N.,ANDZADECK, F. K. 1991. Efficiently com-
puting static single assignment form and the control dependence graph.ACM Trans. Program. Lang. Syst. 13, 4, 451–490.
DARTE, A.ANDROBERT, Y. 1994. Constructive methods for scheduling uniform loop nests.IEEE Trans. Parall. Distrib. Syst. 5, 8, 814–822.
DEHNERT, J. C.ANDTOWLE, R. A. 1993. Compiling for the cydra 5.J. Supercomput. 7, 1-2, 181–227. DOUILLET, A.ANDGAO, G. R. 2005. Register pressure in software-pipelined loop nests: Fast com- putation and impact on architecture design. InThe 18th International Workshop on Languages and Compilers for Parallel Computing (LCPC’05). Hawthorne, NY, 17–31.
DOUILLET, A.ANDGAO, G. R. 2007. Software-pipelining on multi-core architectures. InProceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007). IEEE Computer Society, 39–48.
EBCIOGLU, K.ANDNAKATANI, T. 1990. A new compilation technique for parallelizing loops with
unpredictable branches on a vliw architecture. InSelected Papers of the Second Workshop on Languages and Compilers for Parallel Computing. Pitman Publishing, London, UK, 213–229. GAO, G. R., NING, Q.,ANDDONGEN, V. V. 1993. Software pipelining for nested loops. ACAPS Tech
Memo 53, School of Computer Science, McGill Univ., Montr´eal, Qu´ebec.
HENDREN, L. J., GAO, G. R., ALTMAN, E. R.,ANDMUKERJI, C. 1992. A register allocation framework
based on hierarchical cyclic interval graphs. InProceedings of the 4th International Conference on Compiler Construction (CC ’92). Springer-Verlag, 176–191.
HUFF, R. A. 1993. Lifetime-sensitive modulo scheduling. In Proceedings of the ACM SIG- PLAN 1993 Conference on Programming Language Design and Implementation (PLDI’93). Al- buquerque, 258–267.
INTEL. 2001. Intel IA-64 Architecture Software Developer’s Manual. Vol. 1: IA-64 Application Architecture. Intel Corporation, Santa Clara, CA.
LAM, M. 1988. Software pipelining: An effective scheduling technique for VLIW machines. In Proceedings of the ACM SIGPLAN 1988 Conference on Programming Language Design and Im- plementation (PLDI’88). 318–328.
LAMPORT, L. 1974. The parallel execution of DO loops.Comm. ACM 17, 2, 83–93.
LAWLER, E. L., LENSTRA, J. K., KHAN, A. H. G. R.,ANDSHMOYS, D. B. 1985. The Traveling Salesman Problem: A Guided Tour of Combinatorial Optimization. John Wiley & Sons.
MOON, S.-M.ANDEBCIOGLU˘ , K. 1997. Parallelizing nonnumerical code with selective scheduling
and software pipelining.ACM Trans. Program. Lang. Syst. 19, 6, 853–898.
MUTHUKUMAR, K.ANDDOSHI, G. 2001. Software pipelining of nested loops. Lecture Notes in Com-
puter Science, Vol.2027, 165–181.
RAMANUJAM, J. 1994. Optimal software pipelining of nested loops. InProceedings of the 8th In- ternational Parallel Processing Symposium. IEEE, 335–342.
RAU, B. R. 1994. Iterative modulo scheduling: An algorithm for software pipelining loops. In Proceedings of the 27th Annual International Symposium on Microarchitecture. San Jose, CA, 63–74.
RAU, B. R.ANDFISHER, J. A. 1993. Instruction-level parallel processing: History, overview and perspective.J. Supercomput. 7, 9–50.
RAU, B. R., LEE, M., TIRUMALAI, P. P.,ANDSCHLANSKER, M. S. 1992. Register allocation for modulo scheduled loops: Strategies, algorithms and heuristics. HP Labs Tech. rep. HPL-92-48, Hewlett- Packard Laboratories, Palo Alto, CA.
RONG, H., DOUILLET, A., GOVINDARAJAN, R., AND GAO, G. R. 2004. Code generation for single-
dimension software pipelining of multi-dimensional loops. InProceedings of the International Symposium on Code Generation and Optimization (CGO’04). IEEE Computer Society, 175– 186.
RONG, H.ANDGOVINDARAJAN, R. 2007. Advances in software piplining. InThe Compiler Design Handbook: Optimization and Machine Code Generation, 2nd Ed. Y. N. Srikant and P. Shankar, Eds. CRC, Chapter 20.
RONG, H., TANG, Z., GOVINDARAJAN, R., DOUILLET, A., AND GAO, G. R. 2007a. Single-dimension
of Electrical and Computer Engineering, University of Delaware, Newark, DE. In ftp://ftp.capsl.udel.edu/pub/doc/memos/memo049.ps.gz.
RONG, H., TANG, Z., GOVINDARAJAN, R., DOUILLET, A.,ANDGAO, G. R. 2007b. Single-dimension soft-
ware pipelining for multidimensional loops.ACM Trans. Architec. Code Optim. 4, 1, 7.
TURKINGTON, K., MASSELOS, K., CONSTANTINIDES, G. A.,ANDLEONG, P. 2006. FPGA based acceleration of the linpack benchmark: A high level code transformation approach. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL). Madrid, Spain. IEEE, 1–6.
WANG, J.AND GAO, G. R. 1996. Pipelining-dovetailing: A transformation to enhance software pipelining for nested loops. InProceedings of the 6th International Conference on Compiler Con- struction (CC ’96). Springer-Verlag, London, UK, 1–17.