Precise Formulation - Partial-Ready Code Motion

6.3 Partial-Ready Code Motion

6.3.1 Precise Formulation

The realization of this is more complex; one first step is to use separated local and global prece-dence constraints for all depenprece-dence edges that could be ignored by PR code motion:

a^↑An ≤ a^↑Am ∀(m, n) ∈ ED^{P R}, ∀A ∈ Θ^a(m) ∩ Θ^a(n) (6.3.5)

tn∈G(A) t_n≤t+wmn−1

x^Atn ⁿ+

tm∈G(A) t_m≥t

x^Atm^m ≤ 1 (6.3.6)

∀(m, n) ∈ ED^{P R}, ∀A ∈ Θ^x(m) ∩ Θ^x(n), ∀t ∈ Gmn(A)

We have already discussed in Sec. 5.1.2 that the separated precedence constraint are equivalent to the general constraints as regards the set of integer feasible solutions, but we have also noted there that they are not as tight as the latter. The rationale behind the separation is that we have defined PR code motion at basic block granularity and not at instruction granularity: A partial-ready copy

is intended to ignore dependences on instructions that are scheduled in other successor blocks (along other control flow paths), but never on instructions scheduled in the same block. We schedule around blocks, but never around individual instructions⁹. In other words, dependences are ignored globally, but not locally. Thus the local precedence constraints (6.3.6) are never violated by a partial-ready copy so that we can fully concentrate on the global variant (6.3.5), of which a far lower number is generated.

We adapt these constraints in such a way that each copy of an instruction n respects the data dependence ondi ∈ PR^V(n) if and only if it is Si-defined. For this purpose, we introduce additional, more differentiateda variables that characterize Si-defined copies:

a^Snⁱ^↑A = 1 ⇒ An Si-defined copy of instructionn is scheduled on each program path throughs(n) before A.

These new variables are added for each instructionn ∈ V and each Si ∈ PR^B(n) for all blocks inΘ^aS_i(n) := Θ^a(n) ∩ B^≺(Si). They represent additional information about the schedule that is relevant for the precise modeling of ignorable dependences in the presence of PR code motion. In the following, we demonstrate their incorporation into the model exemplarily for an instruction n. To consider how they should be “activated”, imagine that we move, starting from Ω, in a given schedule in the opposite direction of a program pathC ∈ C(s(n)) upwards. Along this path the conventionalanvariables are equal to one until the latest scheduled copy ofn is encountered or until a block Si ∈ PR^B(n) is reached for the first time: then according to Def. 6.3.1 the next scheduled copy ofn along this path will be Si-defined; thus for the subsequent blocks along the path the variablea^Snⁱ^↑A should be equal to one instead ofa^↑An . This is achieved by the following constraints, of which each instance is intended to replace the corresponding instance of constraint (6.3.1):

a^↑Sn ⁱ ≤ a^Snⁱ^↑A+

t∈G(A)

x^Atn ∀A, Si ∈ Θ^a(n) : (A, Si) ∈ EB∧ Si ∈ PR^B(n)

Once equal to one, the newanvariables are propagated in the same way as the conventional ones:

a^Sn^↑B ≤ a^Sn^↑A+

t∈G(A)

x^Atn ∀A, B ∈ Θ^aS(n) : (A, B) ∈ EB

The global precedence constraints (6.3.5) are not instantiated for edges (m, n) ∈ ED^{P R}. In-stead, the following variant is generated for the newanvariables in order to implement Def. 6.3.2-(2), namely that an ignorable dependence on a definition di ∈ PR^V(n) is respected only if the copy ofn is Si-defined:

a^Sn^↑A ≤ a^↑Am

∀(m, n) ∈ ED, ∀S ∈ PR^B(n) : S = s(m) ∨ (m, n) /∈ ED^{P R},

∀A ∈ Θ^a(m) ∩ Θ^aS(n) (6.3.7)

9It would also be possible to permit a fine-grain variant of PR code motion at instruction level with multiple, differently defined copies of a use not only on a single control flow path, but even within a single basic block. We have not investigated this possibility further since it would entail a massive complexity increase (multiple sorts ofx variables) that stands in no relation to the expected benefit.

In both sorts of global precedence constraints ((6.3.5) and (6.3.7)) we have to allow for the possibility that also the instructionm is subject to PR code motion. Then its global placement is possibly not only described by thea^↑Am variables alone, but also by the new a^Smⁱ^↑A variables.

Given an instructionn, multiple of these variables belonging to the same block A can be equal to one: For example, if∀i ∈ J : a^Snⁱ^↑A = 1 for a subset J ⊆ {1, . . . , k}, then the copies of n scheduled on all program paths throughs(n) before A are Si-defined for alli ∈ J. If such a copy is actually placed in a blockA beforeA, then it holds ∀i ∈ J : a^Snⁱ^↑A = 0 since Equ. (6.3.2) is also instantiated for the newanvariables. As a result, Corollary 6.3.5 still applies if “a^↑Bn = 1”

is replaced by: and (6.3.7) in order to take partial-ready copies ofm into account.

Figure 6.9: A different application of PR code motion in the case study. The three definitions from blocksB, C, and F are abbreviated as k, l, and m, respectively.

If we assume in the case study that all definitions can be moved upwards through predicated code motion, then an application of PR code motion as in the schedule of Fig. 6.9 is possible and feasible in the developed precise formulation (but not in the lightweight version). The figure depicts the resulting variable values together with a subset of the instantiated a-x constraints

(marked asAX) and global precedence constraints (marked as GB). It shows, for instance, how theC-defined, B-defined, and A-defined copy of the use in block A respects the dependences on bothk and l, but ignores the dependence on m globally (since a^Fn^↑D = 0).

An implementation of the described modeling first identifies candidates for PR code motion.

As mentioned, partial-ready execution occurs with speculative operands and is therefore only allowed for speculative instructions (including those added in Sec. 6.2). The set of candidates is formed of all speculative instructions with a not dominating data dependence on another in-struction. If multiple of such dependences with respect to different source registers exist, then those on the closest definition is selected for PR code motion. An extension towards simultane-ous PR code motion with respect to multiple source registers is straightforward, but will not be elaborated on here because the expected practical benefit is low.

When enabling PR code motion of these candidates in the model as described above, it is highly recommended to relax thea-x constraints not only for the candidates themselves, but also for depending instructions that could be speculated together with a partial-ready copy. In the above presentation, we have only concentrated on the partial-ready copies itself, but in fact the benefit from their early execution can be multiplied if together with them also data dependent instructions are scheduled earlier. In Fig. 6.7, for example, copies of an instruction that depends on Y=r1 could be scheduled after each of the partial-ready copies in the four blocks. Such indi-rectly partial-ready copies do not diindi-rectly violate a data dependence on their own, but indiindi-rectly since they are scheduled within the speculative scope of an instruction they depend on. Hence the relaxation of the a-x constraints (Def. 6.3.2-(1)) is already sufficient to make such copies feasible.

Partial-ready code motion might seem as a relatively simple variant of code motion at first sight, but the required adaptations are far-reaching and complicated, even on a semi-formal level of presentation. As indicated, they are also not without impact on the efficiency of the model.

Nevertheless, it is worth the complications, as the experiments show later: there already the implemented lightweight variant proves to be highly valuable on some input programs, especially in combination with cyclic code motion, which is presented in the next section.

In document Optimal Global Instruction Scheduling for the Itanium® Processor Architecture (Page 179-182)