• No results found

Selecting One of Several Values

Predication

10.3.3.1 Selecting One of Several Values

When several control paths that each compute a different value of a single variable meet, a sequence of conditionals is usually required to select which value will be used to update the variable. The use of predication can efficiently implement this code without branches:

switch (rW) case 1: rA = rB + rC; break; case 2: rA = rE + rF; break; case 3: rA = rH - rI; break;

The entire switch-block above can be executed in a single cycle using predication if all of the predicates have been computed earlier. Assume that if rW equals 1, 2, or 3, then one of p1, p2, or

p3 is true, respectively:

(p1)add rA=rB,rC (p2)add rA=rE,rF (p3)sub rA=rH,rI ;;

Without this predication capability, numerous branches or conditional move operations would be needed to collapse these values.

IA-64 allows multiple instructions to target the same register in the same clock provided that only one of the instructions writing the target register is predicated true in that clock. Similar capabilities exist for writing predicate registers, as discussed in Section 10.3.1.

10.3.3.2

Reducing Register Usage

In some instances it is possible to use the same register for two separate computations in the presence of predication. This technique is similar to the technique for allowing multiple writers to store a value into the same register, although it is a register allocation optimization rather than a critical path issue.

After if-conversion, it is particularly common for sequences of instructions to be predicated with complementary predicates. The contrived sequence below shows instructions predicated by p1 and

p2, which are known by the compiler to be complementary:

(p1)add r1=r2,r3 (p2)sub r5=r4,r56 (p1)ld8 r7=[r2] (p2)ld8 r9=[r6] ;; (p1)a use of r1 (p2)a use of r5 (p1)a use of r7 (p2)a use of r9

Assuming registers r1, r5, r7, and r9 are used for compiler temporaries, each of which is live only until its next use, the preceding code segment can be rewritten as:

(p1)add r1=r2,r3 (p2)sub r1=r4,r56// Reuse r1 (p1)ld8 r7=[r2] (p2)ld8 r7=[r6] ;;// Reuse r7 (p1)a use of r1 (p2)a use of r1 (p1)a use of r7 (p2)a use of r7

The new sequence uses two fewer registers. With the 128 registers that IA-64 provides this may not seem essential, but reducing register use can still reduce program and register stack engine spills and fills that can be common in codes with high instruction-level parallelism.

10.3.4

Improving Instruction Stream Fetching

Instructions flow through the pipeline most efficiently when they are executed in large blocks with no taken branches. Whenever the instruction pointer needs to be changed, the hardware may have to insert bubbles into the pipeline either while the target prediction is taking place or because the target address is not computed until later in the pipeline.

By using predication to reduce the number of control flow changes, the fetching efficiency will generally improve. The only case where predication is likely to reduce instruction cache efficiency is when there is a large increase in the number of instructions fetched which are subsequently predicated off. Such a situation uses instruction cache space for instructions that compute no useful results.

10.3.4.1

Instruction Stream Alignment

For many processors, when a program branches to a new location, instruction fetching is performed on instruction cache lines. If the target of the branch does not start on a cache line boundary, then fetching from that target will likely not retrieve an entire cache line. This problem can be avoided if a programmer aligns instruction groups that cross more than one bundle so that the instruction groups do not span cache line boundaries. However, padding all labels would cause an unacceptable increase in code size. A more practical approach aligns only tops of loops and commonly entered basic blocks when the first instruction group extends across more than one bundle. That is, if both of the following conditions are true at some label L, then padding previous instruction groups so that L is aligned on a cache line boundary is recommended:

The label is commonly branched to from out-of-line. Examples include tops of loops and commonly executed else clauses.

To illustrate, assume code at label L in the segment below is not cache-aligned and that a cache boundary occurs between the two bundles. If a program were to branch to L, then execution may split issue after the third add instruction even though there are no resource oversubscriptions or stops: L: { .mii add r1=r2,r3 add r4=r5,r6 add r7=r8,r9 } { .mfb ld8 r14=[r56] ;; nop.f nop.b }

On the other hand, if L were aligned on an even-numbered bundle, then all four instructions at L

could issue in one cycle.

10.4

Branch and Prefetch Hints

Branch and prefetch hints are architecturally defined to allow the compiler or hand coder to provide extra information to the hardware. Compared to hardware, the compiler has more time, looks at a wider instruction window (including the source), and performs more analysis. Transfer of this knowledge to the processor can help to reduce penalties associated with I-cache accesses and branch prediction.

Two types of branch-related hints are defined by the IA-64 architecture: branch prediction hints and instruction prefetch hints. Branch prediction hints let the compiler recommend the resources (if any) that should be used to dynamically predict specific branches. With prefetch hints, the compiler can indicate the areas of the code that should be prefetched to reduce demand I-cache misses. Hints can be specified as completers on branch (br) and move to branch register (abbreviated mov2br in this text since the actual mnemonic is mov br=xx). The hints on branch instructions are the easiest to use since the instruction already exists and the hint completer just has to be specified. mov2br instructions are used for indirect branches. The exact interpretation of these hints is implementation specific although the general behavior of hints is expected to be similar between processor generations.

It is also possible to re-write the hint fields on branches later using a binary rewriting tools. This can occur statically or at execution time based on profile data without changing the correctness of the program. This technique allows IA-64 static hints to be tailored for usage patterns that may not be fully known at compilation time or when the binaries are first distributed.

10.5

Summary

This chapter has presented a wide variety of topics related to optimizing control flow including predication, branch architecture, multiway branches, parallel compares, instruction stream alignment, and branch hints. Although such topics could have been presented in separate chapters, the interplay between the features is best understood by their effects on each other.

Predication and its interplay on scheduling region formation is central to IA-64 performance. Unfortunately, discussion of compiler algorithms of this nature are far beyond the scope of this document.