B. Algorithm of Simultaneous Gate Sizing and ? ? Assignment 93
3. Gate-level Task Scheduling
We describe the gate-level task scheduling in the context of topological order traversal of the circuit. The techniques for reverse topological order traversal are almost the same. In a topological order traversal, a gate is a processed gate if the computation of delay/power for all of its implementations is completed. A gate is called a current gate if all of its fanin gates are processed gates. We term a gate as a prospective gate when all of its fanin gates are either current gates or processed gates. In GPU-based parallel computing, multiple current gates can be processed at the same time because there is no inter-dependency among their computations. A set of current gates are called independent current gates (ICG), since there is no computational interdependency among them. Due to the restriction of GPU computing bandwidth, the number of current gates which can be processed at the same time is limited. A critical problem here is how to select a subset of independent current gates for parallel processing.
This subset is designated as concurrent gate group, which has a maximum allowed size.
The way of forming a concurrent gate group may greatly affect the efficiency of utilizing the GPU-based parallelism. This can be illustrated by the example in Fig.
20. In Fig. 20, the processed gates, independent current gates and prospective gates are represented by dashed, grey and white rectangles, respectively. If the maximum group size is 4, there could be at least two different ways of forming a concurrent gate group for the scenario of Fig. 20(a). In Fig. 20(b), {𝐺1, 𝐺2, 𝐺3, 𝐺4} are selected to be the concurrent gate group. After they are processed, any four gates among {𝐺5, 𝐺6, 𝐺7, 𝐺8, 𝐺9} may become the next concurrent gate group. Alternatively, we can choose {𝐺2, 𝐺3, 𝐺4, 𝐺5} as in Fig. 20(c). However, after {𝐺2, 𝐺3, 𝐺4, 𝐺5} are processed, we can include at most three gates {𝐺1, 𝐺9, 𝐺10} for the next concurrent
gate group since a fanin gate for {𝐺6, 𝐺7, 𝐺8} has not been processed yet. The selection of concurrent gates in Fig. 20(c) is inferior to that in Fig. 20(b) since Fig.
20(c) cannot fully utilize the bandwidth of concurrent group size 4.
The problem of finding concurrent gate group among a set of independent current gates can be formulated as a max-throughput problem, which maximizes the minimum size of all concurrent gate groups. The max-throughput problem is very difficult to solve. Therefore, we will focus on a reduced problem: max-succeeding-group. Given a set of independent current gates, the max-succeeding-group problem asks to choose a subset of them as the concurrent gate group such that the size of the succeeding independent gate group is maximized. We show in the appendix that the max-succeeding-group problem is NP-complete.
Algorithm 13: Concurrent Gate Group Selection
Since the max-succeeding-group problem is NP-complete, we propose a linear-time heuristic to solve it. This heuristic iteratively examines the prospective gates
and puts a few independent current gates into the concurrent gate group. For each prospective gate, we check its fanin gates which are independent current gate. The number of such fanin gates is called ICG (independent current gate) fanin size. In each iteration, the prospective gate with the minimum ICG fanin size is selected.
Then all of its ICG fanin gates are put into the concurrent gate group. After this, the selected prospective gate will no longer be considered in subsequent iterations.
At the same time, the selected ICG fanin gates are not counted in the ICG fanin size of the remaining prospective gates.
In the example of Fig. 20, the prospective gate with the minimum ICG fanin size is 𝐺9. When it is selected, gate 𝐺3 is put into the concurrent gate group. Then, the ICG fanin size of 𝐺7 becomes 1, which is the minimum. This requires that gate 𝐺1 is put into the concurrent gate group. Next, any two of 𝐺2, 𝐺4 and 𝐺5 can be selected to form the concurrent gate group of size 4.
Here is the rationale behind the heuristic. The maximum allowed size of concur-rent gate group can be treated as a budget. The goal is to maximize the number of succeeding ICGs. If a prospective gate has a small ICG fanin size, selecting its ICG fanin gates can increase the number of succeeding ICGs with the minimum usage of concurrent gate group budget.
This heuristic is performed on CPU once. The result, which is the gate-level scheduling, is saved since the same schedule is employed repeatedly in the traversals of the JRR algorithm (see Section B). The pseudo code for the concurrent gate selection heuristic is given in Algorithm 13. The minimum ICG fanin size is updated each time an ICG fanin size is updated, so the computation time is dominated by fanin size updating. If the maximum fanin size among all gates is 𝐹𝑖, each gate can be updated on its ICG fanin size for at most 𝐹𝑖 times. Thus, the time complexity of this heuristic is 𝑂(∣𝑉 ∣𝐹𝑖), where 𝑉 denotes the set of nodes in the circuit.
Multiprocessori
Texture cache Shared Memory
ALU 1 ALU 2 ALUm
registers registers registers Instruction
Unit
Global memory Constant cache
Fig. 21. A multiprocessor with on-chip shared memory. Device memory is connected to the processors through on-chip caches.