Scheduling Serially - The Proposed Algorithm

Chapter 5 A Parallel Algorithm for APN Scheduling

5.2 The Proposed Algorithm

5.2.1 Scheduling Serially

To explain the node migration process, we consider the special case that only a single processing element is used to execute the PBSA algorithm. We refer to the serialized algorithm simply as the BSA algorithm.

In the scheduling of a task graph for an arbitrary network of processors, scheduling of communication messages must be carefully handled lest the schedule length increases. However, in previous approaches the message scheduling issue has not been tackled satisfactorily. Usually the message scheduling scheme has to be supplied as a routing table for the scheduler. The routing table is used when a processor is being considered for accommodating a node. From the routing table, the possible routes reaching this processor from the node’s parents have to be examined. The scheduling algorithm has to know whether each and every link in a route has been occupied during the period of communication. Checking such routing information for every candidate processors inevitably results in high time-complexity. Moreover, the routing information is either statically hard-wired in the routing table (as in the BU algorithm [135]) or dynamically updated (as in the MH algorithm [50]). In the static approach, the routing information is inflexible and may not be appropriate so that some messages may be competing for some links while other links are idle. In the dynamic approach, to maintain accurate routing information, the routing table needs to be frequently updated, thus, the scheduling algorithm may have high time-complexity. Furthermore, in either approach, the routing information is usually maintained for only a few common network topologies which may not be useful for an arbitrary network.

To cope with the message routing and scheduling problem, we employ an incremental adaptive approach. The task graph is first serialized through the CPN-Dominant sequence (refer to Section 2.6.5.6 for a detailed discussion) which is then injected into a single processor, called the pivot processor, which is the one with the largest number of links. This process is called serial injection. The CPN-Dominant sequence is used because it can capture the relative importance of nodes for scheduling. The nodes are then incrementally “bubbled- up” by migration to the adjacent processors in a breadth-first order. In the course of node migration, messages are also incrementally routed and scheduled to the most suitable time slots on an optimized route. This is because a node will not migrate if its start-time cannot be

reduced by the migration or the destination processor for migration is not a valid choice as specified by the underlying routing scheme.

A candidate node for migration is a node that has its data arrival time (DAT)—defined as the time at which the last message from its parent nodes finishes delivery—earlier than its current start-time on the pivot processor. The goal of the migration is to schedule the node to an earlier time slot on one of the adjacent processors that allows the largest reduction of the start-time of the node. To determine the largest start-time reduction, we need to compute the DAT and the start-time of the node on each adjacent processor. Simply put, a node can be scheduled to a processor if the processor has an idle time slot that is later than the node’s DAT and is large enough to accommodate the node. The following rule formalizes the computation of the start-time of a node on a processor.

Rule I: A node can be scheduled to a processor P in which m nodes have been scheduled if there exists some k such that

where , , and . The start-time of on

processor P is given by with l being the smallest k satisfying

the above inequality.

Rule I states that to determine the start-time of a node on a processor, we have to examine the first idle time slot, if any, before the first node on that processor. We check whether the overlapping portion, if any, of the idle time slot and the time period in which the node can start execution, is large enough to accommodate the node. If there is such an idle time slot, the start-time for the node is the earliest one; if not, we proceed to try the next idle time slot.

The DAT of a node on a processor is constrained by the finish-times of the parent nodes and the message arrival times. If the node under consideration and its parent node are scheduled to the same processor, the message arrival time of this parent node is simply its finish-time on the processor (intra-processor communication time is ignored). On the other hand, if the parent node is scheduled to another processor, the message arrival time depends on how the message is routed and scheduled on the links. To schedule a message on a link, we can view the link as a resource similar to a processor and search for a suitable idle time slot on the link to accommodate the message. Simply put, a message can be scheduled to a link if the link has an idle time slot that is later than the source node’s finish-time and is large enough to accommodate the message. Rule II, which is similar to Rule I, formalizes the scheduling of a message to a link.

ni {nP1,nP2, ,… nPm} ST n_P k+1,P ( ) max FT n_P k,P ( ),DAT n( _i,P) { } – ≥w n( )_i k = 0, ,… m ST n( P_m+1,P) = ∞ FT n( P0,P) = 0 ni max FT n{ ( P_l,P),DAT n( i,P)}

Rule II: A message can be scheduled to a link L on which m messages have been scheduled if there exists some k such that

where and , . The start-time of on L,

denoted by , is given by with r being the

smallest k satisfying the above inequality.

Rule II states that to determine the start-time of a message on a link, we have to examine the first idle time slot, if any, before the first message on that link. We check whether the overlapping portion, if any, of the idle time slot and the time period in which the message can start transmission, is large enough to accommodate the message. If not, we proceed to try the next idle time slot. If there is indeed one such idle time slot, the start-time for the message is the earliest one.

The DAT of the node on the processor is then simply the largest value of the message arrival times. The parent node that corresponds to this largest value of the message arrival time is called the Very Important Parent (VIP).

It should be noted that in Rule II we do not explicitly model the underlying message switching technique of the target processor network. Thus, we can insert an appropriate method for modeling whether the messages are wormhole routed or circuit-switched. In a wormhole-routed network, messages are broken up into fixed-size portions called flits. Thus, a message normally does not occupy the whole route simultaneously in the course of communication. By contrast, in a circuit-switched network, a message occupies all the links involved in its route for the whole time period of communication.

In our approach, messages are automatically routed in the migration process of nodes from the pivot processors to other processors. A node will only migrate from a processor to its neighbor in each step, but multi-hop messages are routed incrementally after successive iterations of node migrations. Thus, in our approach, there is no need to use a routing table.

The sequential BSA algorithm [111], [113] is outlined below. The procedure Build_processor_list() constructs a list of processors in a breadth-first order from the first pivot processor. The procedure Serial_injection() constructs the CPN-Dominant sequence of the nodes and injects all the nodes to the first pivot processor.

e_x = (n_i,n_j) e1, ,… em { } MST e( k+1,L) –max MFT e{ ( k,L),FT n( i,Proc n( )i )} ≥cij k = 0, ,… m MST e( m+1,L) = ∞ MFT e( o,L) = 0 ex MST e( _x,L) max MFT e{ ( _r+1,L),FT n( i,Proc n( )i )}

The Bubble Scheduling and Allocation Algorithm:

(1) Load processor topology and input task graph (2) Pivot_TPE←the processor with the highest degree (3) Build_processor_list(Pivot_TPE)

(4) Serial_injection(Pivot_TPE)

(5) whileProcessor_list_not_emptydo

(6) Pivot_TPE← first processor of Processor_list (7) for each ni on Pivot_TPEdo

(8) ifST(ni, Pivot_TPE) > DAT(ni, Pivot_TPE)orProc(VIP(ni))≠Pivot_TPEthen

(9) Determine DAT and ST of ni on each adjacent processor PE’

(10) if there exists a PE’ s.t. ST(ni, PE’) < ST(ni, Pivot_TPE)then

(11) Make ni to migrate from Pivot_TPE to PE’

(12) Update start-times of nodes and messages

(13) else ifST(ni,PE’) = ST(ni,Pivot_TPE)andProc(VIP(ni)) = PE’then

(14) Make ni to migrate from Pivot_TPE to PE’

(15) Update start-times of nodes and messages

(16) end if

(17) end if

(18) end for

(19) end while

If the target processors are heterogeneous, the decision as to whether a migration should be taken is determined by the finish-times of the nodes rather than the start-times. This is because for heterogeneous processors, the execution time of a task varies for different processors; hence, even if a task can start at the same time for two distinct processors, its finish-times can be different. Moreover, the first pivot processor will be the one on which the CP length is the shortest in order to further minimize the finish-times of the CPNs by exploiting the heterogeneity of the processors.

The procedure Build_processor_list() takes time because it involves a breadth-first traversal of the processor graph. The procedure Serial_injection() takes time because the CPN-Dominant sequence can be constructed in time. Thus, the dominant step is the while-loop from step (5) to step (19). In this loop, it takes O(e) time to compute the ST and DAT values of the node on each adjacent processor. If migration is done, it also takes O(e) time. Since there are O(v) nodes on the Pivot_TPE and O(p) adjacent processors, each iteration of the while loop takes O(pev) time. Thus, the BSA algorithm takes time. Notice that this time-complexity is comparable to those of other APN algorithms surveyed in Chapter 2.

The above incremental scheduling method can also be extended to heterogeneous processors in the following manner: the decision as to whether a migration should be taken is determined by the finish-times of the nodes rather than the start-times. Moreover, the first pivot processor will be the one for which the CP length is the shortest so as to further

O p( )2

O e( +v) O e( +v)

minimize the finish-times of the CPNs by exploiting the heterogeneity of the processors.

In document High-Performance Algorithms for Compile-Time Scheduling of Parallel Processors (Page 149-153)