Scheduling and Code Compaction - Applicability for Compiler Generation

3.3 Applicability for Compiler Generation

4.3.3 Scheduling and Code Compaction

The result of the instruction selection phase is the set of assembly instructions that constitute the compiler output. The scheduler decides in which sequence these instructions are issued. Furthermore many processor architectures have the capability of issuing several instructions in a single cycle. To produce eﬃcient code for such instruction level parallelism (ILP) architectures, the scheduler must be able to ﬁnd instructions that can be issued in parallel. The technique for scheduling instructions in parallel is called code compaction.

There are two principal impediments to achieve the goal of a valid instruction schedule [15]:

Data hazards: data dependencies force a minimum temporal distance between two instructions.

Those are directly related to the temporal I/O behavior of instructions. Three types of data hazards can be distinguished [90]:

True dependence: A ﬁrst instruction writes a value to a resource (e.g. a register) that

is to be read by the second instruction. For the read operation to follow the write operation the scheduler must take into account that the ﬁrst instruction might need several clock cycles until the value is written. Typical examples for such instructions are multiply/divide operations or loads from memory. True dependencies are also called

False dependence: The ﬁrst instruction writes to a resource that is also written by the sec-

ond instruction. After executing both instructions the resource shall contain the value of the second instruction. Those are also called write after write (WAW) dependencies.

Anti dependence: If the ﬁrst instruction wants to read a value from a resource before it is

written by a second instruction, then this is called an anti dependence or write after read (WAR) dependence. For processors that do not have interlocking hardware the anti dependencies are usually negative because the write operation of the ﬁrst instruction happens in a late pipeline stage whereas the read operation of the second instruction happens in an early stage.

Structural hazards: Structural hazards result from instructions that utilize exclusive processor

resources. If two instructions require the same resource, these two instructions are mutually exclusive and must be serialized. Typical examples of structural hazards are the issue slots available on a processor architecture: It is never possible to issue more instructions in a cycle than the number of available slots.

To schedule a basic block a scheduling graph G is constructed. G = (N, E) is a dependence graph annotated with additional information: The nodes n ∈ N of G are instructions. allocations(n) are the resources allocated by instruction n∈ N. The edges e = (n₁, n₂)∈ E represent instruction pairs with a data dependency. weight(e), e = (n₁, n₂)∈ E represents the minimum delay between the instructions n₁ and n₂.

A schedule S : n→ c is a mapping function that assigns a clock cycle c ∈ N₀ to each instruction

n ∈ N. An instruction schedule S is correct if the following conditions are met:

1. cycle(n₁) + weight(e)≤ cycle(n₂) ∀ e = (n₁, n₂)∈ E 2. allocations(n₁)∩ allocations(n₂) =∅ ∀ n₁, n₂ ∈ N

The ﬁrst condition ensures that no data dependencies are violated. The second condition is used to avoid structural hazards.

Equation 4.1 deﬁnes the length L(S) of the schedule S. I is the set of all processor instructions.

L(S) = max(S(n) + max(weight(n, m))) ∀ n ∈ N, m ∈ I (4.1) The second addend in this equation represents the conservative calculation of the instruction latency of n. A schedule S_o is optimal if L(S_o) ≤ L(S) for all other correct schedules S. An assembly program with an optimal schedule takes the least possible execution time. Unfortu- nately the calculation of S_o is an NP-complete problem. To reduce compilation time, heuristic algorithms are used.

A very common approach is list scheduling. A list scheduler has a worst-case running time of

RISC architectures the achieved execution speed of the generated assembly program is very close to the optimum schedule [117].

A list scheduler takes a dependence DAG representing a basic block as depicted in ﬁgure 4.6 as input. It selects one or more of the nodes that have no predecessor (the so-called ready set) to be

Figure 4.6: Instruction dependency for which listBT gives suboptimal results

scheduled into a cycle determined by a current_cycle variable. The scheduled nodes are removed from the DAG, the current_cycle is potentially incremented and the loop starts again. In the example of ﬁgure 4.6, current_cycle is initialized to 0 and the list scheduler would schedule instruction 1 which is the only ready node (i.e. it has no predecessor) into that cycle. The node is removed from the DAG and instruction 2 becomes ready. If we assume that the underlying architecture has only a single issue slot, it is not possible to schedule any other instruction into current_cycle (which is still 0). Due to this resource conﬂict, current_cycle is incremented. Since no latency constraint is violated, instruction 2 is scheduled into cycle 1. After another scheduling loop, instruction 3 is scheduled into cycle 2. The resulting instruction sequence is 1-2-3. List scheduling produces good results if the basic block contains a lot of instructions. Unfortu- nately especially loop bodies often contain only few instructions that are executed very often. One solution for this problem is to unroll the loop to a certain degree. This has the disadvantage that the code size is increased.

Better results can be achieved if the scheduler is able to schedule across basic block boundaries.

Trace Scheduling [51] is a technique that analyzes the control ﬂow between basic blocks. Based on

estimations or profiling feedback it analyzes the most frequently used control paths. The sequence of the instructions that are part of the path are called trace. A standard basic block scheduling technique is used to schedule the instructions in the trace. Compensation code is added at each entry to and exit from the trace. This usually means that code fragments are copied across control flow boundaries to compensate for out-of-order execution effects. Loop bodies are usually unrolled several times before being scheduled.

Another scheduling technique that is very eﬀective for scheduling loops on ILP architectures is

software pipelining [100]. The concept of this technique is to execute diﬀerent instructions from

several loop iterations in parallel. If for example a loop body consists of three instructions a, b, and c that need to be executed in sequence it might be possible to execute the instructions as depicted in ﬁgure 4.7. The indices indicate the loop iteration. Note that in this example there must be three or more loop iterations to create a valid software pipelined loop (i≥ 3).

Figure 4.7: A software pipelined loop

The prologue and the epilogue sections contain instructions of linear control ﬂow that are required to set up and to terminate the loop kernel. The kernel is executed multiple times and can be spread over several clock cycles.

It was demonstrated in [42] that software pipelining can improve the execution speed of the generated assembly program by factors. Unfortunately the implementation eﬀort of software pipelining is quite high because many potential constraints need to be analyzed before a loop can be software pipelined (e.g. loop carried dependencies, memory aliasing, loop iteration variable, minimum loop count).

In document C compiler aided design of application specific instruction set processors using the machine description language LISA (Page 43-46)