• No results found

Overlapping Resource Usage

Predication, Control Flow, and

10.2.4.5 Overlapping Resource Usage

Before performing if-conversion, the programmer must consider the execution resources consumed by predicated blocks in addition to considering flow-dependency height. The resource availability

height of a set of instructions is the minimum number of cycles taken considering only the

execution resources required to execute them.

The code below is derived from an if-then-else statement. Given the generic machine model that has only two load/store (M) units. If a compiler predicates and combines these two blocks, then the resource availability height through the block will be four clocks since that is the minimum amount of time necessary to issue eight memory operations:

then_clause: ld r1=[r21] // Cycle 0 ld r2=[r22] // Cycle 0 st [r32]=r3 // Cycle 1 st [r33]=r4 ;;// Cycle 1 br end_if else_clause: ld r3=[r23] // Cycle 0 ld r4=[r24] // Cycle 0 st [r34]=r5 // Cycle 1 st [r35]=r6 ;;// Cycle 1 end_if:

As with the example in the previous section, assuming various misprediction rates and taken branch penalties changes the decision as to when to predicate and when not to predicate. One case is illustrated below.

10.2.4.6

Case 1

Suppose the branch condition mispredicts 10% of the time and that the predicated code takes four clocks to execute. The average number of clocks for:

Non-predicated code is: (10 cycles * 10%) + 2 cycles = 3 cycles

Predicated code is: 4 cycles

Predicating this code would increase execution time even though the flow dependency heights of the branch paths are equal.

10.2.5

Guidelines for Removing Branches

The following if-conversion guidelines apply to cases where only local behavior of the code and its execution profile are known:

1. The flow dependency and resource availability heights of both paths must be considered when deciding whether to predicate or not.

2. If if-conversion increases the length of any control path through the original code sequence, careful analysis using profile or misprediction data must be performed to ensure that execution time of the converted code is equivalent to or better than unpredicated code.

3. If if-conversion removes a branch that is mispredicted a significant percentage of the time, the transformation frequently pays off even if the blocks are significantly unbalanced since mispredictions are very expensive.

4. If the flow-dependeny heights of the paths being if-converted are nearly equal and there are sufficient resources to execute both streams simultaneously, if-conversion is often

advantageous.

Although these guidelines are useful for optimizing segments of code, the behavior of some programs is limited by non-local effects such as overall branch behavior, sensitivity to code size, percentage of time spent servicing branch mispredictions, etc. In these situations, the decision to use if-convert or perform other speculative transformation becomes more involved.

10.3

Control Flow Optimizations

A common occurrence in programs is for several control flows to converge at one point or for multiple control flows to start from one point. In the first case, multiple flows of control are often computing the value of the same variable or register and the join point represents the point at which the program needs to select the correct value before proceeding. In the second case, multiple flows may begin at a point where several independent paths are taken based on a set of conditions. In addition to these multiway joins and branches, the computation of complex compound

conditions normally requires a tree-like computation to reduce several conditions into one. IA-64 provides special instructions that allow such conditions to be computed in fewer tree levels. A third control-flow related optimization uses predication to improve instruction fetching by if-conversion to generate straight-line sequences that can be efficiently fetched. The use and optimization of these cases is described in the remainder of this section.

10.3.1

Reducing Critical Path with Parallel Compares

The computation of the compound branch condition shown below requires several instructions on processors without special instructions:

if ( rA || rB || rC || rD ) { /* If-block instructions */ }

/* after if-block */

The pseudo-code below, shows one possible solution uses a sequence of branches:

cmp.ne p1,p0 = rA,0 cmp.ne p2,p0 = rB,0 (p1)br.cond if_block (p2)br.cond if_block cmp.ne p3,p0 = rC,0 cmp.ne p4,p0 = rD,0 (p3)br.cond if_block (p4)br.cond if_block // after if-block

On many IA-64 implementations, this sequence is likely to require at least two cycles to execute if all the conditions are false, plus the possibility of more cycles due to one or more branch

mispredictions. Another possible sequence computes an or-tree reduction:

or r1 = rA,rB

or r2 = rC,rD ;;

or r3 = r1,r2 ;;

cmp.ne p1,p2 = r3,0 (p1)br if_block

This solution requires three cycles to compute the branch condition which can then be used to branch to the if-block.

Note: It is also possible to predicate the if-block using p1 to avoid branch mispredictions. To reduce the cost of compound conditionals, IA-64 has special parallel compare instructions to optimize expressions that have and and or operations. These compare instructions are special in that multiple and/or compare instructions are allowed to target the same predicate within a single instruction group. This feature allows the possibility that a compound conditional can be resolved in a single cycle.

For this usage model to work properly, IA-64 requires that the programmer ensure that during any given execution of the code, that all instructions that target a given predicate register must either:

Write the same value (0 or 1) or

Do not write the target register at all.

This usage model means that sometimes a parallel compare may not update the value of its target registers and thus, unlike normal compares, the predicates used in parallel compares must be initialized prior to the parallel compare. Please see Part I: IA-64 Application Architecture Guide for full information on the operation of parallel compares.

Initialization code must be placed in an instruction group prior to the parallel compare. However, since the initialization code has no dependencies on prior values, it can generally be scheduled without contributing to the critical path of the code.

The instructions below shows how to generate code for the example above using parallel compares:

cmp.ne p1,p0 = r0,r0 ;; // initialize p1 to 0 cmp.ne.or p1,p0 = rA,r0 cmp.ne.or p1,p0 = rB,r0 cmp.ne.or p1,p0 = rC,r0 cmp.ne.or p1,p0 = rD,r0 (p1)br.cond if_block

It is also possible to use p1 to predicate the if-block in-line to avoid a possible misprediction. More complex conditional expressions can also be generated with parallel compares:

if ((rA < 0) && (rB == -15) && (rC > 0)) /* If-block instructions */

The assembly pseudo-code below shows a possible sequence for the C code above:

cmp.eq p1,p0=r0,r0;; // initialize p1 to 1

cmp.ne.and p1,p0=rB,-15 cmp.ge.and p1,p0=rA,r0 cmp.le.and p1,p0=rC,r0

When used correctly, andor compares write both target predicates with the same value or do not write the target predicate at all. Another variation on parallel compare usage is where both the if and else part of a complex conditional are needed:

if ( rA == 0 || rB == 10 ) r1 = r2 + r3;

else

r4 = r5 - r6;

Parallel compares have an andcm variant that computes both the predicate and its complement simultaneously. cmp.ne p1,p2 = r0,r0 ;; // initialize p1,p2 cmp.eq.or.andcm p1,p2 = rA,r0 cmp.eq.or.andcm p1,p2 = rB,10 ;; (p1)add r1=r2,r3 (p2)sub r4=r5,r6

Clearly, these instructions can be used in other combinations to create more complex conditions.

10.3.2

Reducing Critical Path with Multiway Branches

While there are no special instructions to support branches with multiple conditions and multiple targets, IA-64 has implicit support by allowing multiple consecutive B-slot instructions within an instruction group.

An example uses a basic block with four possible successors. The following IA-64 multi-target branch code uses a BBB bundle template and can branch to either block B, block C, block D, or fall through to block A: label_AA: ... // Instructions in block AA { .bbb (p1)br.cond label_B (p2)br.cond label_C (p3)br.cond label_D } // Fall through to A label_A: ... // Instructions in block A

The ordering of branches is important for program correctness unless all branches are mutually exclusive, in which case the compiler can choose any ordering desired.

10.3.3

Selecting Multiple Values for One Variable or Register with