The basic approach described in Sections 4.3 and 4.4 is limited to amany-to-one map- ping of operations to function units. Using this mapping, a function unit may accept multiple classes of operations, but each operation class can be bound to a single func- tion unit type. As an example, an ALU may accept both multiply and add operations, but if so, multiply and add operations canonly be scheduled on an ALU, and no other function unit type. As a result of this limitation, no choice exists when binding an operation to a resource.
Therefore, a more general strategy for binding is needed. In this section, I describe an extension to the proposed scheduling approach to allow for amany-to-many mapping of operations to function units. In practice, this extension allows for a wider variety of legal schedules and allocations, and therefore may produce better solutions.
4.5.1
Modified annotations
In Section 4.3, each node is immediately bound to a specific resource class containing function unit properties such as area and latency. This binding allows for one-time analysis of some computed DFG annotations but prevents an operation from execut- ing on alternative function units, e.g., an addition operating on either an ALU or a dedicated adder.
Instead, we would prefer to bind each DFG node simply to an operation class, such as addition or multiplication. The actual binding of node to resource (i.e., function unit) is performed during the scheduling process. As a result, certain DFG annotations must be parametrized by allocation and recomputed multiple times during execution of the solver.
The main computed annotation used in the original approach is the ST T F, or shortest time to finish. This property indicates the earliest that a node and its children can complete execution in the presence of infinite resources.
The computation of ST T F in Section 4.3 is as follows:
ST T F(Ni) = max j|Ni→Nj
(ST T F(Nj)) +EX(Ni) (4.3) whereNi →Nj indicates thatNj is dependent onNi andEX(N) represents the execu- tion time of the function unit this node is bound to. However, as the modified approach does not initially bind a node to a function unit, the equation must be changed:
ST T F(Ni) = max j|Ni→Nj
(ST T F(Nj)) + min k|Rk⇐Ni
(EX(Rk, Ni)) (4.4) whereEX(R, N) represents the execution time of a function unit R for N’s operation, and Rk ⇐ Ni indicates node Ni can operate on function unit Rk. This formulation is subject to the restriction that a resource Rk can only be considered if its allocation
root+! root-! b(A)+! b(*)+! a(A)+! a(*)+! a(*)+! a-! b-! a-! b(A)+! b(*)+! b-! b(A)+! b(*)+! a-! a(A)+! sink+! sink-! b-!
Figure 4.11: Full expansion of the DAG for a two-operation DFG with one ALU and one multiplier
count is greater than zero, hence the requirement that the DFG must be re-annotated with each new allocation.
This equation highlights several benefits of the modified approach: (i) an operation can be scheduled on multiple different types of function units, (ii) a function unit may accept multiple different types of operations, and (iii) a function unit may have parametrized latencies for each operation type it accepts.
4.5.2
Expanding the search space
The modified approach adds complexity to the search space by allowing multiplenode+ nodes corresponding to the same operation to exist as children of a single node. For example, a(A)+, which represents node a operating on an ALU, and a(∗)+, which represents nodea operating on a multiplier, can both be children of the same node, as shown in Figure 4.11. The selection of possible children nodes is allocation-dependent,
as some allocations may lack certain function unit types.
4.5.3
Modified time bound
One of the keys to the branch-and-bound approach is the selection of safe but effective bounds to reduce the overall search space. In Section 4.4, the resource-constrained shortest-time-to-finish bound (RCST T F) is used to effectively prune the search space. Conceptually, this bound solves a simpler, less-constrained scheduling problem: find the best possible schedule in the absence of dependencies.
The bound is calculated by generating a list of nodes that have yet to start for a specific operation type and sorting them by their ST T F value. At this point, the maximum ST T F node can be summed with the latency of the current node, giving a loose bound on the earliest this path could finish. An additional RC term is added that takes into consideration the time that a node can be expected to wait to gain access to the resource it is bound to under the current allocation. The addition of this term forms an even tighter minimum bound on latency.
Because the original bound relied on having a many-to-one mapping of operations to resources, we must now modify the (RCST T F) bound to incorporate the possibility of a many-to-many mapping. Since a node could be executed on a function unit that accepts multiple operation types, the best option is a conservative approach. In order to ensure that the bound is safe, when analyzing wait terms for a specific operation class (e.g., addition) the wait terms associated with multi-operation function units (e.g., ALUs) operate under the assumption that this operation class has exclusive access to the function unit. A sample calculation for this bound for a single class of operation is shown in Table 4.2, in which an add operation takes 6 time units on a dedicated adder and 10 time units on an ALU.
Table 4.2: Modified RCST T F bound for one adder (6 unit latency) and one ALU (10 unit latency)
Node Current Time STTF Wait Term Estimated Finish
q+ 50 22 0 72 r+ 50 16 0 66 s+ 50 16 6 72 t+ 50 14 10 74 u+ 50 10 12 72 v+ 50 8 18 74
4.5.3.1 Additional optimizations and considerations
When performing time-minimization in Section 4.4, each allocation generated is run through the resource-constrained time-minimization method and the best solution is maintained at each step to aid in rapidly pruning the search space. Because the number of legal allocations may increase significantly when multiple resource types are available, we can now extend this algorithm by performing two additional optimizations: (i) a heuristic pass is first performed for each allocation to provide a good initial time bound, and (ii) the allocation that provided the best heuristic solution is explored first.
The rationale behind these optimizations is that several allocations have a minimum latency which is much greater than true minimum provided by another allocation. Minimizing latency with respect to these allocations is generally a fruitless and time- consuming endeavor. Therefore, a heuristic pass can help order these allocations such that the best allocations are considered first.
One final optimization is to sort children of a node by considering the latency of the bound function unit type in addition to ST T F. In this way, the algorithm will prioritize execution on lower-latency function units over higher-latency function units, if both are available.