Non-Exclusive Use Forking - Refinement and Summary of the Model

5.3 Refinement and Summary of the Model

6.2.2 Non-Exclusive Use Forking

The speculation scheme presented in the previous section can, as remarked there, only then be profitably applied if the candidate has an exclusive use. This is because all non-exclusive uses must respect the dependence on the new non-speculative move to ensure that they read the result of the candidate only if the speculation succeeds (and otherwise that of another, concurrent definition). Fig. 6.4 shows such a case where speculating the concurrent defini-tion “add r8=-16,r10” (in (b)) would not allow to schedule the non-exclusive use “sub r11=r11,r8” earlier than before (in (a)).

However, we can schedule the use earlier if we split it up into two versions: one which reads the result of the speculative version of the use, “add r8b=-16,r10”, and another which reads r8as written by the other concurrent definition, “ld4 r8=[r33]”. As depicted in Fig. 6.4 (c), these two speculative versions of the non-exclusive use are designed to write their results to two new temporary registers r11a and r11b, respectively. Once the predicates are available in cycle three, the correct result is written back to r11.

We can go even further and fork not only non-exclusive uses, but also other instructions that depend on them. In doing so, we effectively execute two versions of the same sequence of instructions in parallel, one for each of the two possible outcomes of a (yet to be made) control flow decision (which is represented by the predicates in the example). They are executed concurrently and once the decision is resolved, the results of the correct sequence are moved back to the original registers and those of the other one are discarded.

In the example, we have not only forked the use, but also the dependent store. However, this instruction is special in the sense that it is non-speculative. Such instructions can also be forked, yet not speculatively; instead the two versions must be guarded by the predicates that represent the disjoint control flow paths. In this regard, we distinguish between speculative and predicated forking. Predicated forking can save one cycle if applied to the last instruction in a forked chain since it avoids the data dependence(s) on the move(s). In Fig. 6.4 (c), for instance, without forking the store “st4 [r33]=r11” had to be scheduled in cycle four in order to respect the RAW dependences on the last two moves in cycle three.

ld4 r9=[r32]

Figure 6.4: Forking and speculating a non-exclusive use.

The critical path length reductions achievable through forking can be drastic, as demonstrated by the 40% improvement in the example. However, the example also shows that it has the potential to triple the instruction count: It duplicates each forked instruction and adds for each a move. At least, this move can be omitted if there exists no other use of the register that is not forked, so that the eventual increase lies somewhere between a doubling and a tripling. Such a trading of an increased number of parallel, speculative instructions for a decreased critical path length complies with the basic principles of the architecture.

Non-exclusive use forking can be incorporated into the model very similarly to the previous speculation decisions using mutually exclusive sets of instructions. We do not expand on the details here, but we make use of a tentative implementation in the experiments (see Sec. 7.2).

The proposed transformation is similar to multipath execution, a hardware technique that forks on hard-to-predict branches and executes both resulting control flow paths in parallel until the branch is resolved [ASMC98]. To some extent, speculative upward code motion already implements such a control speculative execution of instructions from two different paths, but the presented transformation goes beyond that in the regard that it duplicates instructions from the point where the two paths join again and speculates them, too. It is an example of how aggressive speculation in combination with predication can shrink the critical path considerably, which is a key requirement for more instruction-level parallelism.

6.2.3 Data Speculation

An inevitable prerequisite for the use of data speculation is the availability of a measure of the likelihood of aliasing: Ideally, we have for each memory dependence edge e ∈ ED^mem (or at least for the RAW edges) an aliasing probability κe ∈ ]0, 1] given. Dependences with κe = 1 are called must-dependences and the others may-dependences. Analogously, we use the terms must/may-definitions and must/may-uses in the context of such dependences. In the construction of the global data dependence graph in Def. 3.2.6, it is important that the exclusion criterion (the second point) applies only to must-definitions and must-uses.

Control Speculation Data Speculation

Loads Checks Loads Checks

Variant Mnmc. Variant Mnmc. Variant Mnmc. Variant Mnmc.

Control

Table 6.1: Overview of the speculation-related instructions from Sec. 2.1.5. The notions from [Int02a] are given together with the mnemonics.

(Note: the table does not imply a correspondence between loads and checks in the same line.)

We can employ data speculation separately or in combination with control speculation (using the advanced loads ld.a and ld.sa, respectively; a short overview of the notions is provided by Table 6.1). In both cases, only a check load ld.c or an advanced load check chk.a is required instead of the chk.s. The resulting main difference as regards the scheduler is that the former two instructions can only be executed onML units, while the control speculation check can issue to MS or I. This difference is significant in practice since the two ML units can be considered as scarce resources. Moreover, it forces us to consider data speculation checks as separate instructions—they cannot be merged with the control speculation check of the specula-tion scheme from the previous secspecula-tion, at least not without complicated changes to the resource constraints.

At least, we can regard the ld.c and the chk.a as one single instruction in the ILP. We refer to it as the “combined check” ld.c/chk.a in the following. After scheduling, this instruction is turned into an ld.c or into a chk.a, as described below. Figure 6.5 depicts how it is incor-porated into the speculation scheme (initially, all dotted arrows can be ignored): All potentially aliasing stores⁷ are separated from the group “Other DDG predecessors” and form a group of their own. LetST = {st1, . . . , st^k} ⊆ V denote the set of these stores. The speculative version

7In principle, a store is “potentially aliasing” with a subsequent load if for the memory dependence edge e between the two holdsκe< 1. In practice, however, memory dependences with κe> 0.1 can already be regarded

(p1) ld.c rY=[rZ]/chk.a rY ld.sa rY=[rZ]

Other DDG predecessors

(2)

Speculative and exclusive uses

Non-speculative uses

Compares Potentially

aliasing stores

0 (* )¹

(* )²

Figure 6.5: Data dependences in∆^Dn .

is no longer dependent on them, but the combined check instruction is. All speculative and exclu-sive uses are dependent on the speculative version, but all non-speculative uses are dependent on the combined check with latency zero—this is the same as with the control speculation scheme of Fig. 6.3.

If only the use of data speculation should be made possible, then this scheme can be used in a stand-alone way: the “ld.sa rY=[rZ]” is replaced by an advanced load “ld.a rY=[rZ]”, which is a non-speculative instruction. Consequently, it must be guarded by the same predicate register as the candidate and made dependent on the same compares (differing from Fig. 6.5).

Alternatively—and as originally depicted by the figure—the possibility to use control and/or data speculation can be included in the ILP. Fig. 6.5 can then be considered as an addendum to the scheme of Fig. 6.3. Both are to be combined in such a way that “ld.s rY=[rZ]” and

“ld.sa rY=[rZ]” are the same instruction. Altogether, three mutually exclusive (but partly overlapping) sets of instructions are distinguished:

• ∆n: The candidate.

• ∆^Cn: The speculative version plus the chk.s and/or the mov.

• ∆^Dn: The speculative version plus the combined check ld.c/chk.a and/or the mov.

(In case of stand-alone control and data speculation, the third and the second set do not exist, respectively).

as must-dependences since the benefit from ignoring them with data speculation will hardly outweigh the penalty cycles due to frequent failures. The choice of this threshold depends on the data speculation failure penalties of the target processor.

Now two question remain to be clarified regarding the integration with the scheduling ILP:

How can the decision between control speculation alone and combined with data speculation be modeled? How can it be determined whether an advanced load check instruction chk.a in place of a simple check load ld.c is necessary, and is this distinction important?

The first question can only emerge if the ILP solver should have the choice between either control speculation alone or in combination with data speculation. To model this choice, we employ two new binary variablesSn^C andSn^D that are intended to be equal to one exactly in the first and the second case, respectively, and add the equation:

Sn= Sn^C+ Sn^D

Then we change the assignment constraints and the precedence constraints in such a way that the sets∆^Cn and∆^Dn appear in the schedule if and only ifSn^C andSn^D are equal to one, respectively.

This is done as described in the previous section. Finally, we include instances of the precedence constraints (5.3.12) for the may-dependences marked by(∗₁) in Fig. 6.5, but add Sn^D to the right-hand side of them so that these dependences can be ignored if and only if data speculation is being used (Sn^D = 1).

Such constraints need not to be added for a may-dependence on a store sti that is a DDG predecessor of another store st^j whose source block postdominates that of stⁱ. In this case, a violated dependence onstialready implies a violated dependence on stj so that the precedence constraints related tosti are redundant. We denote by ST ⊆ ST the subset of those stores for which the constraints are not redundant.

Regarding the second question, it should be recalled from Sec. 2.1.5.2 that uses can only be speculated together with an advanced load if the advanced load check instruction chk.a is used, which branches to recovery code. However, the penalty of failed data speculation is then with 20 cycles and more significantly higher than the 8-cycle penalty if a check load is used (these penalties are denoted bypDA andpDC, respectively, in the following).

There are three ways how to deal with this difference: Firstly, if it can be supposed that a failure is extremely unlikely, then the distinction between the two sorts is dispensable and can be ignored in the ILP. Then after the optimization, the combined checks in the schedule are replaced by check loads or, where necessary, by advanced load checks with recovery code.

Secondly, if we have for an advanced load a general estimate of the failure probability that is not negligible (denoted byκ), then this can be taken into account in the objective function. For this we introduce a further binary variableSn^DA that is intended to be equal to one if and only if data speculation is used with an advanced load check instead of a check load. If this variable has value one, then alsoSn^D must have value one (since then data speculation is being used), as ensured by the following constraint:

Sn^DA ≤ Sn^D

Furthermore, we include precedence constraints for the dependences marked by (∗2) in Fig. 6.5, but this time we add Sn^DA to their right-hand sides to model that these dependences can be ignored if and only if an advanced load check is being used (Sn^DA = 1). Then we can add the term

p^DCfs(n)κ Sn^D+

(pDA− pDC)fs(n)κ Sn^DAto the objective function to take the penal-ties into account. In this term, the penalty cycles are weighted by the failure probabilipenal-ties and by

the execution frequency of the check’s source block,fs(n); the terms in the square brackets are constants.

The third and most precise approach breaks the failure probability down into the components contributed to by the different speculated may-dependences. For this purpose, we introduce for each storesti ∈ ST a new pair of mutually exclusive binary variables, S_(st^DA_i_,n₎ and S_(st^DC_i,n), of which one is equal to one if and only if the advanced load is scheduled before this store, namely the former if Sn^DA = 1 and the latter if Sn^DA = 0, respectively. The reason why we need two additional variables is that the objective function must remain linear—otherwise we could have a single variableS_(st^D_i_,n₎ and use the productS_(st^D_i_,n₎· Sn^DA instead of S_(st^DA_i_,n₎, for example. The penalty weighted by the aggregate failure probability—divided into advanced load check and check load parts—is then given by the following sum, which is added to the objective function:

k i=1

pDAfs(n)κ(sti,n) S_(st^DA_i,n)+

pDCfs(n)κ(sti,n) S_(st^DC_i,n)

Similarly as above, the term S_(st^DA_i_,n₎ + S_(st^DC_i_,n₎ is added to right-hand side of all instances of the precedence constraints (5.3.12) that are generated for a may-dependence(∗₁) on a store sti

(instead of the variableSn^D). The following constraints are necessary to connect theSn^D variable to the new variables. If one of the new variables has value one, then also Sn^D must have value one (since then data speculation is being used):

S_(st^DA_i,n)+ S_(st^DC_i,n)≤ Sn^D ∀sti ∈ ST

Finally, we must add the following inequalities to enforce that theS_(st^DC_i_,n₎variables can only be equal to one if no advanced load check is employed:

S_(st^DC_i,n)+ Sn^DA ≤ 1 ∀sti ∈ ST

After the ILP solver has returned a solution, minor postprocessing is necessary if a data speculation possibility was utilized in the schedule (visible from the value ofSn^D in the solution):

Copies of the speculative version are turned into an “ld.sa” if they are scheduled speculatively and into an “ld.a” else. The combined check is replaced by a check load or an advanced load check depending on whether a use is speculated in the schedule or not. In the latter case, recovery code is also added.

In document Optimal Global Instruction Scheduling for the Itanium® Processor Architecture (Page 167-172)