Global Instruction Scheduling Algorithms

3.3 Global Scheduling

3.3.1 Global Instruction Scheduling Algorithms

Global scheduling is often introduced on the basis of concrete scheduling methods like trace scheduling [Fis81], one of the first proposed algorithms. Trace scheduling identifies frequently executed paths in the basic block graph, called traces, and schedules these linear sequences of basic blocks as if they were a single basic block, using a local scheduling method like list scheduling [WM97]. During this local scheduling, instructions can be moved beyond the block boundaries of the underlying basic block sequence. When doing this, the compiler must avoid speculative code motion of non-speculative instructions and track where compensation copies have to be inserted (bookkeeping).

B C

80%

20%

B C

D D’

B C

Figure 3.2: Illustration of a trace, a superblock after tail duplication, and a hyperblock.

These additional copies may increase the schedule lengths of other paths—in this way trace scheduling optimizes traces at the expense of off-trace paths, which may be disadvantageous for programs without distinct hot paths. The algorithm first schedules the most frequently executed trace and then selects, in decreasing order of path frequency, traces with unscheduled blocks, until all blocks are covered.

Another drawback of trace scheduling is the complexity of the bookkeeping that is related to joins in the trace (side entrances). Hwu et al. propose superblock scheduling to mitigate this

complexity [HMC⁺93]: the selected traces are transformed via tail duplication (i.e., splitting its tail at joins into two different copies) into traces with a single entry and multiple exits. Then upward code motion inside the superblock never enforces compensation copies (the downward variant can possibly be left out without a significant performance loss). Superblock scheduling is more straightforward; however, the tail duplication implicates a code size increase.

A superblock can be extended to include basic blocks from different control flow paths via predication: If it contains an outgoing control flow edge, then the instructions at the branch target can be merged into the superblock if they are guarded by the predicate that controls the branch (if-conversion)—the superblock is then called a hyperblock. It remains a linear code sequence since if-conversion turns all control dependences into data dependences. The hyperblock exhibits increased parallelism as it contains independent instructions from different paths. This makes the transformation particularly interesting for highly parallel architectures with predication, like EPIC architectures.

The technique that forms and schedules the hyperblocks is known as hyperblock scheduling [MLC⁺92]. Again, tail duplication is performed to keep the hyperblock free of side entrances.

As in superblock scheduling, the resulting code size increase may be significant and must be kept under control by the compiler.

All mentioned methods have in common that they tackle global scheduling by reducing it to local scheduling. For this purpose, they select linear scheduling regions on the basis of profiling information. In contrast, the following four methods work directly on scheduling regions with ar-bitrary acyclic control flow: Percolation scheduling [Nic85] applies iteratively four semantics-preserving transformation rules to the control flow graph, three of which perform upward code motion (between adjacent blocks). This method originally assumed unbounded resources and unit latencies (i.e., latencies that are all equal to one) and was extended in further works [SS02].

Gupta and Soffa propose region scheduling [GS90] to increase and balance the parallelism in a program. The algorithm works on an extended program dependence graph (EPDG), a hier-archical representation of the scheduling region in which the nodes represent individual instruc-tions, predicates or regions of control equivalent instructions. Similar to percolation schedul-ing, the region scheduler repeatedly applies transformations to the EPDG that redistribute code among the regions until no further transformations are possible or the parallelism in each region matches that of the target processor.

The transformations are guided by estimates of the parallelism present in each region (the in-struction count divided by the critical path length). They go beyond code motion and also include loop transformations (unrolling and invariant code motion) and region copying and collapsing (which is basically tail duplication and if-conversion, respectively). Code motion is allowed in both directions, also speculatively or with compensation code. It is not only applicable to the (leaf) nodes of the EPDG that represent instructions, but also to those higher in the hierarchy that represent regions. This means that the scheduler is capable of moving entire subgraphs that represent, for example, an if-statement. In doing so, region scheduling combines scheduling of fine-grain parallelism with transformations of the control structure that expose such parallelism.

Bernstein and Rodeh [BR91] present a comparatively simple global scheduling algorithm that processes the basic blocks inside an acyclic scheduling region in topological order (of the BBG) and schedules them one at a time. The scheduling of each block occurs similarly to list

scheduling. In difference to this local method, the list of data ready¹ candidates for scheduling is made up not only of instructions that originate from the block being scheduled, but also of instructions from control equivalent successor blocks and the direct successors of these blocks. In this way, (speculative) upward code motion is incorporated; compensation copies, however, are not supported by the presented implementation. Once all instructions are scheduled that originate from the current block, the scheduler moves to the next block. The approach has similarities with wavefront scheduling, the (patented) technique implemented in Intel’s Itanium compiler, which will be outlined now.

Wavefront scheduling [BMM00] uses a high level driver that selects scheduling regions and passes them to the scheduler proper. A scheduling region may contain arbitrary acyclic control flow. After it has been scheduled, it is grouped into one or more new BBG nodes and nested. This is a recursive process that starts at the innermost loops. When regions of completed blocks are nested, the data flow information within them (memory references, live-out and live-in values, outgoing latencies) is summarized in order to allow semantically correct code motion across the resulting nested nodes. In this way, code can even be moved across a loop that is abstracted away through nesting.

A specialty of wavefront scheduling is that it makes extensive use of path vectors to represent control-flow related information. LetC = {P1, . . . , Pk} be the set of complete paths through the scheduling region, then each subset ofC can be represented by a vector x ∈ {0, 1}^kwherexi is equal to one if and only if the subset containsPi. For instance, BP V (A) ∈ {0, 1}^k denotes for each basic blockA the subset of paths that flow through it. P rob(x) is the aggregate probability that the control flows along one of the paths embodied by the path vector. Set operations like “∩”

and “\” can be represented by performing simple boolean operations on the path vectors. The size of path vectors can grow exponentially with the region size—hence the regions are selected in such a way that the number of paths does not exceed a certain threshold.

We now describe how a region is scheduled. The scheduler processes the blocks in the region top-down in a certain topological order. This order is defined by the movement of a wavefront, a (changing) set of blocks such that each complete path in the region passes through exactly one block in it. The wavefront can be regarded as the boundary between scheduled and yet to be scheduled blocks in the region. More precisely,

• the blocks above the wavefront have been scheduled,

• the blocks on the wavefront are being scheduled (simultaneously), and

• the blocks below the wavefront still have to be scheduled.

The initial wavefront consists of all entry blocks. Once the scheduling of a block on the wavefront is finished, the scheduler attempts to advance the wavefront across it². Fig. 3.3 depicts how a wavefront can advance down the region until it passes through all the exit nodes (W1-W6).

1An instruction is data ready if it is not data dependent on another, yet unscheduled instruction.

2It is noteworthy that the algorithm requires all JS edges to be removed via JS blocks. Only then it is guaranteed that a block-wise advancement of the wavefront is always possible.

W1 W3 W2

W6W5

Figure 3.3: Wavefront advancement in six steps.

By scheduling instructions only into blocks on the wavefront, it is guaranteed that the com-pensation copies they require can be inserted entirely into other blocks on the wavefront. The scheduler tracks the compensation need of each instruction by means of a compensation path vector. This vector for an instruction n with source block s(n) is initialized to CP V (n) :=

BP V (s(n)). When scheduling n into a block D, it is changed to CP V (n) := CP V (n) \ BP V (D) to update the information where further copies still have to be scheduled. Fig. 3.4 shows an example whereCP V (n) is changed from 11100 to 01100 after n := op¹is scheduled intoD.

The generation of compensation copies does not have to occur immediately, instead it can be deferred until a better opportunity arrives in a later wavefront (in Fig. 3.4 along the wave-frontsW4-W5). Such an opportunity can be a free execution slot in a block for which no other instruction candidate can be found.

However, the compiler must also keep track of the latest feasible blocks where these copies can be scheduled: In the example, wavefront W5is the last possibility. There are other factors that can constrain the downward movement of an instruction, such as the availability of its qual-ifying predicate. In all these cases, the scheduler ensures that the wavefront is only advanced if all instructions that cannot be deferred further have been scheduled (then their compensation vector is−→0 ). Scheduling multiple copies of an instruction on a path is also avoided.

The actual choice which instruction is scheduled next into a block is performed similarly to list scheduling [WM97, Muc97]. For each block on the wavefront, a list of candidate instructions is maintained that contains unscheduled, data ready instructions that originate from a predecessor or a successor of the block. The scheduler selects one of these instructions for scheduling into the block based on a cost-benefit analysis. Instructions are preferred that lie on a global critical

n

^:=op1 W5

D

S=s(n)

Path 2/4 P1/3

P1/2 P3/4

Figure 3.4: Deferred compensation.

DDG path, and—in the case of a speculative candidate—are likely to be useful if scheduled speculatively into the block. The usefulness is the likelihood that the speculative execution is not futile, namely:

P rob (BP V (D) ∩ BP V (s(n))) P rob (BP V (D))

which is the probability that the control flow passes through the instruction’s source blocks(n) if it flows through D. The cost term of the selection function takes speculation costs (when employing control and data speculation etc.) and resulting compensation copies into account.

The support of long-range upward and downward code motion, different sorts of speculation, lazy compensation code insertion, and predication makes wavefront scheduling one of the most complex and powerful global scheduling techniques. An implementation for the Itanium pro-cessor achieved a 30% speedup on SPECint 95 over local scheduling (assuming perfect caches) [BMM00].

One drawback is, however, that it fully relies on path vectors, which limit the complexity of the scheduling region. Furthermore, it is a greedy, top-down algorithm: scheduling decisions are made one at a time, based on a heuristic selection function, and never reconsidered. When deciding about scheduling an instruction into a block, it is unknown whether a better opportunity will arrive in a later wavefront—this depends on other scheduling decisions still to be made.

Such interdependences between decisions are well known in the area of code generation; we have already mentioned the phase-coupling problem between code generation phases.

It is evident that data dependences constitute one of the causal links between scheduling deci-sions that lead to interdependences. But also resource aspects play a major role: Techniques like global code motion, speculative loads, and predication increase the demand for execution slots

(as described for code motion in Sec. 3.3). At the same, they decrease the schedule length (this is why they are applied) and thereby the supply of execution slots in the schedule. The compiler must find a trade-off between these two conflicting effects when applying such features. A too conservative use can lead to empty, wasted execution slots, contrary to the EPIC philosophy. But overuse can force the schedule length to rise again due to execution-slot shortage, which could spoil the benefit.

These multifaceted interactions are our incentive to employ integer linear programming: It has the potential to resolve all interdependences between scheduling decisions and to deliver a global optimum. Note that the notion of “optimality” is used in an algorithmic sense here: The resulting schedule is optimal with respect to our mathematical definition of the global scheduling problem (to be developed in the next section). In a wider context, however, optimality is relative and more complex: for example, it depends on the input set, which we must approximate by profiling information. Also there are a lot of difficult to predict, influential dynamic effects like cache misses interacting with the schedule. Clearly, no mathematical model can fully and precisely describe and minimize these runtime effects. We can achieve strict optimality only within a well-defined problem scope. However, on a statically scheduled architecture, there should be a strong correlation between schedule length and performance—this is also what our experiments in Chapter 7 confirm.

In document Optimal Global Instruction Scheduling for the Itanium® Processor Architecture (Page 73-78)