3.3 Applicability for Compiler Generation
4.3.3 Scheduling and Code Compaction
The result of the instruction selection phase is the set of assembly instructions that constitute the compiler output. The scheduler decides in which sequence these instructions are issued. Furthermore many processor architectures have the capability of issuing several instructions in a single cycle. To produce efficient code for such instruction level parallelism (ILP) architectures, the scheduler must be able to find instructions that can be issued in parallel. The technique for scheduling instructions in parallel is called code compaction.
There are two principal impediments to achieve the goal of a valid instruction schedule [15]:
Data hazards: data dependencies force a minimum temporal distance between two instructions.
Those are directly related to the temporal I/O behavior of instructions. Three types of data hazards can be distinguished [90]:
True dependence: A first instruction writes a value to a resource (e.g. a register) that
is to be read by the second instruction. For the read operation to follow the write operation the scheduler must take into account that the first instruction might need several clock cycles until the value is written. Typical examples for such instructions are multiply/divide operations or loads from memory. True dependencies are also called
False dependence: The first instruction writes to a resource that is also written by the sec-
ond instruction. After executing both instructions the resource shall contain the value of the second instruction. Those are also called write after write (WAW) dependencies.
Anti dependence: If the first instruction wants to read a value from a resource before it is
written by a second instruction, then this is called an anti dependence or write after read (WAR) dependence. For processors that do not have interlocking hardware the anti dependencies are usually negative because the write operation of the first instruction happens in a late pipeline stage whereas the read operation of the second instruction happens in an early stage.
Structural hazards: Structural hazards result from instructions that utilize exclusive processor
resources. If two instructions require the same resource, these two instructions are mutually exclusive and must be serialized. Typical examples of structural hazards are the issue slots available on a processor architecture: It is never possible to issue more instructions in a cycle than the number of available slots.
To schedule a basic block a scheduling graph G is constructed. G = (N, E) is a dependence graph annotated with additional information: The nodes n ∈ N of G are instructions. allocations(n) are the resources allocated by instruction n∈ N. The edges e = (n1, n2)∈ E represent instruction pairs with a data dependency. weight(e), e = (n1, n2)∈ E represents the minimum delay between the instructions n1 and n2.
A schedule S : n→ c is a mapping function that assigns a clock cycle c ∈ N0 to each instruction
n ∈ N. An instruction schedule S is correct if the following conditions are met:
1. cycle(n1) + weight(e)≤ cycle(n2) ∀ e = (n1, n2)∈ E 2. allocations(n1)∩ allocations(n2) =∅ ∀ n1, n2 ∈ N
The first condition ensures that no data dependencies are violated. The second condition is used to avoid structural hazards.
Equation 4.1 defines the length L(S) of the schedule S. I is the set of all processor instructions.
L(S) = max(S(n) + max(weight(n, m))) ∀ n ∈ N, m ∈ I (4.1) The second addend in this equation represents the conservative calculation of the instruction latency of n. A schedule So is optimal if L(So) ≤ L(S) for all other correct schedules S. An assembly program with an optimal schedule takes the least possible execution time. Unfortu- nately the calculation of So is an NP-complete problem. To reduce compilation time, heuristic algorithms are used.
A very common approach is list scheduling. A list scheduler has a worst-case running time of
RISC architectures the achieved execution speed of the generated assembly program is very close to the optimum schedule [117].
A list scheduler takes a dependence DAG representing a basic block as depicted in figure 4.6 as input. It selects one or more of the nodes that have no predecessor (the so-called ready set) to be
Figure 4.6: Instruction dependency for which listBT gives suboptimal results
scheduled into a cycle determined by a current_cycle variable. The scheduled nodes are removed from the DAG, the current_cycle is potentially incremented and the loop starts again. In the example of figure 4.6, current_cycle is initialized to 0 and the list scheduler would schedule instruction 1 which is the only ready node (i.e. it has no predecessor) into that cycle. The node is removed from the DAG and instruction 2 becomes ready. If we assume that the underlying architecture has only a single issue slot, it is not possible to schedule any other instruction into current_cycle (which is still 0). Due to this resource conflict, current_cycle is incremented. Since no latency constraint is violated, instruction 2 is scheduled into cycle 1. After another scheduling loop, instruction 3 is scheduled into cycle 2. The resulting instruction sequence is 1-2-3. List scheduling produces good results if the basic block contains a lot of instructions. Unfortu- nately especially loop bodies often contain only few instructions that are executed very often. One solution for this problem is to unroll the loop to a certain degree. This has the disadvantage that the code size is increased.
Better results can be achieved if the scheduler is able to schedule across basic block boundaries.
Trace Scheduling [51] is a technique that analyzes the control flow between basic blocks. Based on
estimations or profiling feedback it analyzes the most frequently used control paths. The sequence of the instructions that are part of the path are called trace. A standard basic block scheduling technique is used to schedule the instructions in the trace. Compensation code is added at each entry to and exit from the trace. This usually means that code fragments are copied across control flow boundaries to compensate for out-of-order execution effects. Loop bodies are usually unrolled several times before being scheduled.
Another scheduling technique that is very effective for scheduling loops on ILP architectures is
software pipelining [100]. The concept of this technique is to execute different instructions from
several loop iterations in parallel. If for example a loop body consists of three instructions a, b, and c that need to be executed in sequence it might be possible to execute the instructions as depicted in figure 4.7. The indices indicate the loop iteration. Note that in this example there must be three or more loop iterations to create a valid software pipelined loop (i≥ 3).
Figure 4.7: A software pipelined loop
The prologue and the epilogue sections contain instructions of linear control flow that are required to set up and to terminate the loop kernel. The kernel is executed multiple times and can be spread over several clock cycles.
It was demonstrated in [42] that software pipelining can improve the execution speed of the generated assembly program by factors. Unfortunately the implementation effort of software pipelining is quite high because many potential constraints need to be analyzed before a loop can be software pipelined (e.g. loop carried dependencies, memory aliasing, loop iteration variable, minimum loop count).