Speculation - Fundamentals and Basic Notions

1.3 Fundamentals and Basic Notions

2.1.5 Speculation

In the computer architecture field, speculation refers to the early, tentative execution of an oper-ation even if it is not yet known if the result will be needed and correct at a later point of time.

If the assumptions the speculation was based upon turn out to be true, then the result is available earlier; otherwise, the speculation has failed and the result is discarded.

In the case of successful speculation, its benefit can be measured in the number of saved cycles, B, due to the earlier availability of the result. Two kinds of associated costs can be distinguished: a fixed partCf, which exists regardless of success or failure, and a variable part C^v, which incurs only if the speculation fails. The former can also be opportunity costs, i.e., the profit that would have been possible if the resources bound by the speculation had been used otherwise; it is often difficult to quantify this speedup in cycles. The latter can be recovery costs resulting from the roll back of all effects of the speculative action. A speculation is useful on average if it has a positive expected benefit

pB − (1 − p)Cv− Cf

wherep denotes the probability of successful speculation.

The following two subsections present the two major kinds of explicit speculation featured by the Itanium architecture. Both are directed by the compiler and supported by special hard-ware. They aim at executing loads earlier by moving them upwards before conditional branches (control speculation) and potentially data dependent stores (data speculation). This early issuing of loads is generally considered as crucial to cover the memory latency and to avoid load-use stalls.

2.1.5.1 Control Speculation

Control speculation in general denotes the premature execution of an instruction even if it is not yet known whether the execution needs to take place. This can occur by moving code upwards from its original block (termed its source block) to a destination block that is not postdominated by the former (see later Def. 3.2.5). Then it is executed there earlier at runtime, before it is known that the control flow will reach the source block where its execution is actually needed. If the latter is the case, then the result is available earlier and the speculation was successful; otherwise the execution was superfluous and should have no harmful effect on the architectural state.

Most instructions can be executed speculatively (short: speculated), i.e., have no harmful side effects if they are executed though this would not have been the case in the original program (see also the detailed discussion in Sec. 6.2). These instructions are often referred to as speculative;

those without this property are called non-speculative. Memory accesses like loads are in general non-speculative since the address used during a superfluous execution could be invalid and trigger a false exception (which would not have occurred in the original program).

Often it can be proven through static analysis that a load is safe at a destination block, i.e., that it is never executed with an invalid address there [BRS92]. For the remaining cases, IA-64 supports the deferral of load exceptions: If the control speculative load instruction ld.s is used and if the conditions of an exception occur, then the exception is not triggered but instead a

special NaT (Not a Thing) bit associated with the load destination register is set. This bit signals that the value of the register is void as the load failed due to a suppressed exception. Each of the 128 GPRs has such an additional NaT bit; for floating-point registers, the condition is represented by a special register value, NaTVal, which cannot occur as the result of normal computations.

If after the deferral of an exception the source block of the speculated load is not reached, then the set NaT bit has no effect at all—the load destination register is then not live, i.e., it is never read, but will be overwritten (together with the NaT bit) somewhere later.

In the opposite case, if the source block of the load or a control equivalent block is reached, then the exception has been real and must be dealt with. For this purpose a control speculation check instruction chk.s is scheduled there that branches to a specified label if its register argu-ment is NaT or NaTVal. At this label the compiler has generated recovery code that reexecutes the load—this time in a normal, non-speculative version so that the exception is eventually trig-gered. After the exception has been handled—and if this has not terminated the program—the recovery code then returns to the bundle after the chk.s and the program execution resumes there.

Algorithm 4 Control speculation example.

Without Speculation Cycle With Control Speculation Cycle ld8.s r3=[r2] ;; -X-1 add r4=8,r3 ;; -1 (p1) br.cond label 0 (p1) br.cond label 0

ld8 r3=[r2] ;; 0 chk.s r3,recover 0

add r4=8,r3 ;; X back:

shladd r6=r5,2,r4 X+1 shladd r6=r5,2,r4 0

recover:

ld8 r3=[r2] ;;

add r4=8,r3 br back

It is also possible to speculate uses together with the load; these instructions then must also be replicated in the recovery code. Alg. 4 shows an example of a load with latency X that is speculated with its use add r4=8,r3: The latency of the load and the add can be completely hidden if they are both hoisted across the branch (under the assumption that their execution can be overlapped with other code before the branch—not shown in the example).

Note that, strictly speaking, the purpose of the recovery code is here not to recover from a failed control speculation, as the name suggests: The speculation fails in the example if the branch to label is taken since then the computation of r4 was unnecessary. The check and the

recovery code only ensure proper exception handling in the case of successful speculation. This is different for data speculation introduced in the next section.

NaT bits propagate: All instructions set the NaT bits of all destination registers if at least one source register is NaT. Otherwise these NaT bits are always cleared by default (except for spec-ulative loads, of course, which are the only NaT-producing instructions). The same rules apply to the NaTVals of the floating-point registers. NaTs are even propagated by transfer instructions that move data between the general purpose and floating-point registers.

The purpose of this propagation is as follows: If several loads are speculated together with a sequence of dependent instructions, it is sufficient to check the result(s) computed by this sequence to detect a deferred exception of any of the loads². The recovery code then repeats the whole speculative computation (with non-speculative loads). This can become a nontrivial problem if register values used in the computation are no longer available at the point of recovery.

To ensure recoverability, it can be necessary to enforce the availability or reconstructability of these values throughout speculative computations.

If a non-speculative load or store receives a NaT as address operand, the program terminates with a NaT consumption fault. The same happens if it is attempted to store a NaT—as long as the store has not the completer spill: Then the NaT bit is saved in a special register and can be restored by a load with completer fill. These instructions are used during context switches.

2.1.5.2 Data Speculation

Data speculation denotes the hoisting of loads above potentially memory dependent (aliased, ambiguous) stores. In some cases, the compiler might be unable to prove statically that the ac-cessed memory locations do not overlap. Then it is possible to speculate that no aliasing with the store occurs by executing the load as an advanced load ld.a before the store. Such an advanced load executes like a normal load but allocates in addition an entry with the register number and the memory address in a hardware structure called Advanced Load Address Table (ALAT). The presence of this entry in the ALAT signals that the corresponding memory location has been read by an advanced load and not been written afterwards. Consequently, any succeeding store invalidates (i.e., removes) all entries representing a memory location that overlaps with the one modified by the store.

After the store, a check load ld.c must be scheduled with the same operands as the advanced load in order to verify that no aliasing has occurred (otherwise the advanced load would have read incorrect data). For this purpose, it searches the ALAT for an entry with the same register number and type. If such an entry (still) exists, execution continues normally, otherwise the speculation has failed and the ld.c reissues the load.

Alg. 5 shows an example where the use of an advanced load removes the load latency X from the critical path. Remarkably, the check load has a zero-cycle latency to consuming instructions and hence can be scheduled in the same instruction group before them (here theadd). Only if it misses the ALAT it incurs a penalty, which may include a pipeline flush (numbers for the Itanium 2 are given in Sec. 2.2.4).

2For example, it would also possible—but not advantageous—to check r4 in Alg. 4.

Algorithm 5 Breaking a memory dependence with an advanced load.

W/o Speculation Cycle With Data Speculation Cycle

ld8.a r4=[r3] ;; -X or earlier

st8 [r1]=r2 0 st8 [r1]=r2 0

ld8 r4=[r3] ;; 0 ld8.c r4=[r3] 0

add r4=8,r4 ;; X add r4=8,r4 ;; 0

shladd r6=r5,2,r4 X+1 shladd r6=r5,2,r4 1

Algorithm 6 Data speculation with recovery code. Speculating the use add r4=8,r4 with the load saves another cycle compared to Alg. 5.

W/o Speculation Cycle With Data Speculation Cycle ld8.a r4=[r3] ;; -X-1 add r4=8,r4 ;; -1

st8 [r1]=r2 0 st8 [r1]=r2 0

ld8 r4=[r3] ;; 0 chk.a r4,recover 0 add r4=8,r4 ;; X back:

shladd r6=r5,2,r4 X+1 shladd r6=r5,2,r4 0

recover:

ld8 r4=[r3] ;;

add r4=8,r4 br back

If uses are speculated together with the load, the advanced load check instruction chk.a must be used in place of the check load. Instead of simply reissuing the load, the chk.a branches on an ALAT miss to recovery code that reexecutes the load and its uses (see Alg. 6). The whole procedure is very similar to control speculation. Control and data speculation can even be com-bined using a speculative advanced load ld.sa that performs all the operations of both an ld.s and an ld.a. An ALAT entry will not be allocated if this load defers an exception, so a chk.a is sufficient to check for both conditions.

A hardware implementation may realize the ALAT functionality incompletely for complexity or performance reasons. These limitations are always designed in such a way that the ALAT may only err on the right side, i.e., they may only cause unnecessary recoveries, but must not suppress a necessary recovery (they are detailed for the Itanium 2 in Sec. 2.2.4). Thus they can never harm the correctness, but only the performance. However, since the ALAT functionality is integrated with the critical L1 cache access path, a simplified ALAT may be crucial to enabling shorter cycle times and thus higher core frequencies.

Frequent ALAT misses—whether due to ALAT limitations or aliasing—can cause penalties that outweigh the benefit of data speculation. Thus the proper use of this feature in a static compiler relies heavily on static analyses: Firstly, the compiler should employ extensive alias analysis to hoist loads above stores even without data speculation [GLS01]. Secondly, a kind of probabilistic alias analysis is needed that provides estimates of the aliasing probability for may-aliases (the factorp from Sec. 2.1.5) [JCO98, HCLJ01]. However, it may be considered as a challenge to obtain such estimates reliably through static analysis—a critic puts it drastically [Hop00]: “ If the compiler gets the probabilities wrong, the results will be terrible.”

In document Optimal Global Instruction Scheduling for the Itanium® Processor Architecture (Page 45-50)