Bundling Constraints - Optimal Global Instruction Scheduling for the Itanium® Processor Archite

6.6 Bundling

6.6.1 Bundling Constraints

In the remainder of this section, we will deal with a possibility that we have optimistically ignored in the above description of bundling: it can happen—due to intra-group dependences—that no feasible partial bundle sequence exists for an instruction group at all, even though the latter is feasible as defined in Sec. 5.2.1 (i.e., a mapping of the instructions to the execution units exists).

Consider, for example, the following group:

A: chk.s r26, .rec_7 //M/I

B: st4 [r20]=r26 //M

C: chk.a.clr r14, .rec_8 //M

D: st4 [r14]=r18 //M

E: ld8.c.clr r8=[r34] //M

F: add r20=r8,r56 //A

The control speculation check A must be scheduled before the store B (which could otherwise trigger a NaT consumption fault due to r26). Furthermore, the data speculation checks C and Emust appear after the stores B and D, respectively, but before the instructions D and F, which read r14 and r8, respectively. As a result of these intra-group dependences, the six instructions may only appear in exactly the given order in a bundle pair—but apparently this is impossible (A would have to occupy an I-type slot so that according to (2.1.1) none of the four M-type instructions B-E could be placed in the first bundle, but there is only place for two in the second bundle).

Our definition of “feasible instruction groups” in Sec. 5.2.1 is optimistic in the sense that it regards only the numbers of instructions of different types in the group, but ignores bundling-related issues. We call groups that are feasible according to this definition, but cannot be bundled without split issue due to the structure of their intra-group dependences structurally infeasible.

They constitute a rarely occurring, but intricate problem that arises from the separation of in-struction scheduling and bundling in our approach. In the experiments of Chapter 7, only two such groups (of a similar form as the above example) emerged in the optimal schedule computed for one of the input routines (qSort3).

It may be possible to remove structurally infeasible groups in the schedule afterwards by re-ordering instructions between different groups, but it cannot be relied on that the schedule always permits this postpass remedy. The theoretically ideal solution would be to integrate bundling into the global scheduling phase. In [CLF⁺03, CLJ⁺04] it is shown how this integration can be done for a heuristic scheduler that targets the first-generation Itanium. When scheduling instructions into an instruction group, it employs a finite state automaton to keep track of the execution units occupied by them and to ensure that a template assignment exists for it. Each state encodes the currently occupied execution units in the group and is associated with a list of all possible tem-plate assignments that comply with this occupation. The scheduling of an instruction into the group triggers a transition between states. Such a transition is only legal if at least one template assignment associated to the new state satisfies all intra-group dependences—otherwise the cor-responding scheduling decision is rejected. In doing so, the feasibility of bundling is ensured for all instruction groups.

An ILP formulation, however, that incorporates bundling—similar to the one developed for micro-scheduling in [Win01]—would require up to six new variables in place of eachx variable alone to model the placement of instructions into different slots. Thus the number of needed x variables would be multiplied and also the number of required precedence constraints (5.3.12) for the intra-group dependences. The experiments indicate that the interdependences between scheduling and bundling are to weak to justify this massive complexity increase.

Instead, our solution is to prevent the formation of structurally infeasible groups in advance by means of separate bundling constraints. For this purpose, we collect for each basic blockA and each cyclet ∈ G(A) all instructions that can be potentially scheduled there in a set PAt⊆ V (typically,PAt := Θ^x⁻¹(A)). We define a relation ⊆ ED onPAt such thatm n holds for a pairm, n ∈ PAtif these two instructions can appear together in the same group, but thenn must appear afterm there. To check the former condition, we can employ the minimum distances that will be later computed in Sec. 7.1.2: both instructions can be scheduled into the same group only ifd^m,n ≤ 0. The second condition applies if there exists an intra-group DDG edge (m, n) ∈ ED

(withwmn = 0).

The goal is now to find subsets ofPAtthat constitute structurally infeasible instruction groups and to exclude them via additional inequalities. We aim at characterizing these groups as gener-ally as possible. In the above example, the combination that already leads to structural infeasi-bility (following the above argumentation) is a control speculation check A and four dependent M-type instructions B-E such that A B, A C, A D, and A E (F is irrelevant). The following bundling constraint then prevents the formation of this group at cyclet in block A:

xÂtA + xÂtB + xÂtC + xÂtD + xÂtE ≤ 4

We use patterns as an intuitive, unified representation of this and further structural infea-sibility conditions, as some of them are shown in Fig. 6.15. The intended meaning of these patterns is that each potential instruction group that matches one of them (called a match) is structurally infeasible and should be excluded by means of a bundling constraint. A pattern is a graph with nodesv1, . . . , vk such that each nodevi is assigned an instruction type,R(vi) ∈ :R, and a cardinality,|vi| ∈ N+. In Fig. 6.15, only cardinalities greater than one are explicitly shown in parentheses. An instruction group P ⊆ PAt matches this pattern if there exists a function Φ : P → {v₁, . . . , vk} that maps |vi| instructions to each node vi under the following condi-tions:

1. the instruction types must match the node types and

2. if two nodesvi andvj are joined by an edge, then for each pair of instructions m and n mapped tovi andvj, respectively, a path fromm to n exists in the graph (PAt, ).

So to the nodes of the patterns sets of one or more same-type instructions are assigned and the edges represent (chains of) intra-group dependences between the instructions in these sets. The above example group matches Fig. 6.15 (a) by assigning A to the upper node and B-E to the lower node.

CHK.S

Figure 6.15: Patterns of structurally infeasible instruction groups.

The first three infeasibility patterns in Fig. 6.15 exploit the fact that no M-type instruction can be scheduled after an instruction in an I-type slot in the first bundle and that then there is no template for the second bundle that can host more than two M-type instructions ((a), (b)) or the combinationMIM/MMM(c). Fig. 6.15 (d) allows for the rule that the first I-slot instruction in a group always occupies I0. The second F-type instruction of the last pattern must be placed in the second bundle with templateMFI, MFBorMMF—but then there is no slot for a further M-type instruction after it in the bundle (the M-type instruction could be a setf with a WAR (intra-group) dependence on the floating-point instruction).

The bundling constraints for a pattern have the following general form (withw :=k

i=1|vi|):

n∈P

x^Atn ≤ w − 1 ∀A ∈ B, ∀t ∈ G(A), ∀ matches P ⊆ PAt (6.6.2) The number of these constraints is polynomial since|P| = w must be less than or equal to six.

Thus the number of possible matches cannot grow larger than

|PAt| 6

≤ |PAt|⁶. However, this number can still be considerable so that the following reduction is useful: If for a node

|vi| = c(R(vi)) holds, i.e., if its cardinality is equal to the maximal number of instructions of this type in any feasible instruction group, then we can assign not only|vi|, but an unlimited number of instructions to this node, which leads to larger, fewer matches. As an example, consider pattern (a) and the previous example with a fifth instruction F such that A F: It is not necessary to instantiate (6.6.2) for all five possible matches {A, B, C, D, E}, {A, B, C, D, F}, {A, B, C, E, F}, {A, B, D, E, F}, and {A, C, D, E, F} such that A matches the first node of the pattern and the other four instructions the second node; instead, one constraint forP = {A, B, C, D, E, F} is sufficient.

For nodes without this property, as those in pattern (b), however, this simplification is not allowed since the bundling constraints would then become too tight and could exclude feasible groups.

To generate all bundling constraints of a pattern, it is necessary to enumerate all matches of it in the graph (PAt, ). For subgraph isomorphism in general, no better algorithm than an exhaustive enumeration and checking of all potential matches is known [Epp95] (O

|PAt|⁶ comparisons here). For patterns of a simple tree-like or even linear structure as those in Fig. 6.15,

In document Optimal Global Instruction Scheduling for the Itanium® Processor Architecture (Page 198-200)