Architecture
3.5 Optimization of Memory References
3.5.2 Data Interference
Data references with low interference probabilities and high path probabilities can make the best use of data speculation. In the pseudo-code below, assume the probabilities that the stores to *p1 and *p2 conflict with var are independent.
*p1 = /* Prob interference = 0.30 */ . . .
*p2 = /* Prob interference = 0.40 */ . . .
= var /* Load to be advanced */
If the compiler advances the load from var above the stores to pointers p1 and p2, then:
Prob that stores to p1 or p2 interfere with var = 1.0 - (Prob p1 will not interfere with var *
Prob p2 will not interfere with var) = 1.0 - (0.70 * 0.60)
= 0.58
Given the interference probabilities above, there is a 58% probability at least one of p1 and p2 will interfere with a load from var if it is advanced above both of them. A compiler can use traditional heuristics concerning data interference and interprocedural memory access information to estimate these probabilities.
When advancing loads past function calls, the following should be considered:
• If a called function has many stores in it, it is more likely that actual or aliased ALAT conflicts will occur.
• If other advanced loads are executed during the function call, it is possible that their physical register numbers will either be identical or conflict with ALAT entries allocated from calls in parent functions.
• If it is unknown whether a large number of advanced loads will be executed by the called routines, then the possibility that the capacity of that ALAT may be exceeded must be considered.
3.5.3
Optimizing Code Size
Part of the decision of when to speculate should involve consideration of any possible increases in code size. Such consideration is not particular to speculation, but to any
transformations that cause code to be duplicated, such as loop unrolling, procedure inlining, or tail duplication. Techniques to minimize code growth are discussed later in
this section.
In general, control speculation increases the dynamic code size of a program since some of the speculated instructions are executed and their results are never used. Recovery code associated with control speculation primarily contributes to the static size of the binary since it is likely to be placed out-of-line and not brought into cache until a speculative computation fails (uncommon for control speculation).
Data speculation has a similar effect on code size except that it is less likely to compute values that are never used since most non-control speculative data speculative loads will have their results checked. Also, since control speculative loads only fail in uncommon situations such as deferred data related faults (depending on operating system configuration), while data speculative loads can fail due to ALAT conflicts, actual
memory conflicts, or aliasing in the ALAT, the decision as to where to place recovery code for advanced loads is more difficult than for control speculation and should be based on the expected conflict rate for each load.
As a general rule, efficient compilers will attempt to minimize code growth related to speculation. As an example, moving a load above the join of two paths may require duplication of speculative code on every path. The flow graph depicted in Figure 3-3 and the explanation shows how this could arise.
If the compiler or programmer advanced the load up to block B from its original non-speculative position, all speculative code would need to be duplicated in both blocks B and C. This duplicated code might be able to occupy NOP slots that already exist. But if space for the code is not already available, it might be preferable to advance the load to block A since only one copy would be required in this case.
3.5.4
Using Post-increment Loads and Stores
Post-increment loads and stores can improve performance by combining two operations in a single instruction. Although the text in this section mentions only post-increment loads, most of the information applies to stores as well.
Post-increment loads are issued on M-units and can increment their address register by either an immediate value or by the contents of a general register. The following pseudo-code that performs two loads:
ld8 r2=[r1]
add r1=1,r1 ;;
ld8 r3=[r1]
can be rewritten using a post-increment load: ld8 r2=[r1],1 ;;
ld8 r3=[r1]
Post-increment loads may not offer direct savings in dependency path height, but they are important when calculating addresses that feed subsequent loads:
• A post-increment load avoids code size expansion by combining two instructions into one.
• Adds can be issued on either I-units or M-units. When a program combines an add with a load, an I-unit or M-unit resource remains available that otherwise would have been consumed. Thus, throughput of dependent adds and loads can be doubled by using post-increment loads.
Figure 3-3. Minimizing Code Size During Speculation
Block A
Block B Block C
st ld
A disadvantage of post-increment loads is that they create new dependencies between post-increment loads and the operations that use the post-increment values. In some cases, the compiler may wish to separate post-increment loads into their component instructions to improve the overall schedule. Alternatively, the compiler could wait until after instruction scheduling and then opportunistically find places where post-increment loads could be substituted for separate load and add instructions.
3.5.5
Loop Optimization
In cyclic code, speculation can extend the use of classical loop optimizations like invariant code motion. Examine this pseudo-code:
while (cond) {
c = a + b; // Probably loop invariant *ptr++ = c;// May point to a or b }
The variables a and b are probably loop invariant; however, the compiler must assume the stores to *ptr will overwrite the values of a and b unless analysis can guarantee that this can never happen. The use of advanced loads and checks allows code that is likely to be invariant to be removed from a loop, even when a pointer cannot be disambiguated:
ld4.a r1 = [&a] ld4.a r2 = [&b]
add r3 = r1,r2 // Move computation out of loop while (cond) {
chk.a.nc r1, recover1 L1: chk.a.nc r2, recover2 L2: *p++ = r3
}
At the end of the module:
recover1: // Recover from failed load of a ld4.a r1 = [&a]
add r3 = r1, r2
br.sptk L1 // Unconditional branch
recover2: // Recover from failed load of b ld4.a r2 = [&b]
add r3 = r1, r2
br.sptk L2 // Unconditional branch
Using speculation in this loop hides the latency of the calculation of c whenever the speculated code is successful.
Since checks have both a clear (clr) and no clear (nc) form, the programmer must decide which to use. This example shows that when checks are moved out of loops, the no clear version should be used. This is because the clear (clr) version will cause the corresponding ALAT entry to be removed (which would cause the next check to that register to fail).