Software Pipelining and Loop
11.5 Optimization of Loops in IA-
11.5.6 Loop Unrolling Prior to Software Pipelining
In some cases, higher performance can be achieved by unrolling the loop prior to software pipelining. Loops that are resource constrained can be improved by unrolling such that the limiting resource is more fully utilized. In the following example if we assume the target processor has only two memory units, the loop performance is bound by the number of memory units:
L1: ld4 r4 = [r5],4 // Cycle 0
ld4 r9 = [r8],4 ;; // Cycle 0
add r7 = r4,r9 ;; // Cycle 2
st4 [r6] = r7,4 // Cycle 3
br.cloop L1 ;; // Cycle 3
A pipelined version of this loop must have an II of at least two because there are three memory instructions, but only two memory units. If the loop is unrolled twice prior to software pipelining and assuming the store is independent of the loads, an II of 3 can be achieved for the new loop. This is an effective II of 1.5 for the original source loop. Below is a possible pipeline for the unrolled loop:
stage 1:(p16) ld4 r4 = [r5],8 // odd iteration
(p16) ld4 r9 = [r8],8 ;; // odd iteration
stage 2:(p16) ld4 r14 = [r15],8 // even iteration
(p16) ld4 r19 = [r18],8 ;; // even iteration
// --- empty cycle
stage 3:(p18) add r7 = r4,r9 // odd iteration
(p17) add r17 = r14,r19;; // even iteration
stage 4: // --- empty cycle
(p19) st4 [r6] = r7,8 // odd iteration
(p18) st4 [r16] = r17,8 ;; // even iteration
The unrolled loop contains two copies of the source loop body, one that corresponds to the odd source iterations and one that corresponds to the even source iterations. The assignment of stage predicates must take this into account. Recall that each 1 written to p16 sequentially enables all the stages for a new source iteration. During stage one of the above pipeline, the stage predicate for the odd iteration is in p16. The stage predicate for the even iteration does not exist yet. During stage two of the above pipeline, the stage predicate for the odd iteration is in p17 and the new stage predicate for the even iteration is in p16. Thus within the same pipeline stage, if the stage predicate for the odd iteration is in predicate register X, the stage predicate for the even iteration is in predicate register X-1. The pseudo-code to implement this pipeline assuming an unknown trip count is shown below:
add r15 = r5,4
add r18 = r8,4
mov lc = r2 // LC = loop count - 1
mov ec = 4 // EC = epilog stages + 1
mov pr.rot=1<<16;; // PR16 = 1, rest = 0 L1:
(p16) ld4 r33 = [r5],8 // Cycle 0 odd iteration
(p18) add r39 = r35,r38 // Cycle 0 odd iteration
(p17) add r38 = r34,r37 // Cycle 0 even iteration
(p16) ld4 r36 = [r8],8 // Cycle 0 odd iteration
br.cexit.spnt L3 ;; // Cycle 0
(p16) ld4 r33 = [r15],8 // Cycle 1 even iteration
(p19) st4 [r6] = r40,8 // Cycle 2 odd iteration
(p18) st4 [r16] = r39,8 // Cycle 2 even iteration
L2: br.ctop.sptk L1 ;; // Cycle 2
L3:
Notice that the stages are not equal in length. Stages 1 and 3 are one cycle each, and stages 2 and 4 are two cycles each. Also, the length of the epilog phase varies with the trip count. If the trip count is odd, the number of epilog stages is three, starting after the br.cexit and ending at the br.ctop. If the trip count is even, the number of epilog stages is two, starting after the br.ctop and ending at the br.ctop. The EC must be set to account for the maximum number of epilog stages. Thus for this example, EC is initialized to four. When the trip count is even, one extra epilog stage is executed and br.exit L3 is taken. All of the stage predicates used during the extra epilog stages are equal to 0, so nothing is executed.
The extra epilog stage for even trip counts can be eliminated by setting the target of the br.cexit
branch to the next sequential bundle and initializing EC to three as shown below:
add r15 = r5,4
add r18 = r8,4
mov lc = r2 // LC = loop count - 1
mov ec = 3 // EC = epilog stages + 1
mov pr.rot=1<<16;; // PR16 = 1, rest = 0 L1:
(p16) ld4 r33 = [r5],8 // Cycle 0 odd iteration
(p18) add r39 = r35,r38 // Cycle 0 odd iteration
(p17) add r38 = r34,r37 // Cycle 0 even iteration
(p16) ld4 r36 = [r8],8 // Cycle 0 odd iteration
br.cexit.spnt L4 ;; // Cycle 0
L4:
(p16) ld4 r33 = [r15],8 // Cycle 1 even iteration
(p16) ld4 r36 = [r18],8 ;; // Cycle 1 even iteration
(p19) st4 [r6] = r40,8 // Cycle 2 odd iteration
(p18) st4 [r16] = r39,8 // Cycle 2 even iteration
L2: br.ctop.sptk L1 ;; // Cycle 2
L3:
If the loop trip count is even, two epilog stages are executed and the kernel loop is exited at the
br.ctop. If the trip count is odd, the first two epilog stages are executed and then the br.cexit
branch is taken. Because the target of the br.cexit branch is the next sequential bundle (L4), a third epilog stage is executed before the kernel loop is exited at the br.ctop. This optimization saves one stage at the end of the loop when the trip count is even, and is beneficial for short trip count loops.
Although unrolling can be beneficial, there are a few considerations before trying to unroll and software pipeline. Unrolling reduces the trip count of the loop that is given to the pipeliner, and thus may make pipelining of the loop undesirable since low trip count loops sometimes run faster unpipelined. Unrolling also increases the code size, which may adversely affect instruction cache performance. Unrolling is most beneficial for small loops because the potential performance degradation due to under utilized resources is greater and the effect of unrolling on the instruction cache performance is smaller compared to large loops.