CiteSeerX — Hardware/Software Co-design With the HMS Framework

(1)

Hardware/Software Co-design With the HMS Framework

Michael Sheliga

Edwin Hsing-Mean Sha

Dept. of Computer Science & Engineering University of Notre Dame

Notre Dame, IN 46556

ABSTRACT

Hardware/Software co-design is an increasingly common design style for integrated circuits. It allows the majority of a system to designed quickly with standardized parts, while special purpose hardware is used for the time critical portions of the system. The framework considered in this paper performs Hardware/Multi-Software (HMS) co-design for iterative loops, given an input specication that includes the system to be built, the number of available processors, the total chip area, and the required response time. Originally, all operations are done in software. The system then substi- tutes hardware (adder, multiplier, bus) for software based on the needability of each type of hardware unit. After a new hardware unit is introduced the system is rescheduled using a variation of rotation scheduling in which operations may be moved between processors. Experimental results are shown that illustrate the eciency of the algorithms as well as the savings achieved.

(2)

1 Hardware/Software Co-design Introduction

The design of computer systems that incorporates both standardized o the shelf processors, or software, as well as specialized hardware is referred to as hardware/software (hw/sw) co-design [9]. Since the complexity and functionality of the computer systems is increasing at a dramatic rate, it is very dicult for custom systems to be designed, built, and tested within an acceptable time period even with the most advanced computer-aided design tools unless standardized parts are used. However, many systems also have time critical parts which must be implemented in hardware. Hence hw/sw co- design is becoming an increasingly important design style. Hardware/Software systems are able to take advantage of standardized processors which have been previously designed and tested to reduce design time and improve reliability. At the same time they use hardware to meet time and area constraints which could not be met by only using general purpose processors.

This paper presents several algorithms that perform hw/sw co-design given chip area and timing constraints. Since the nal design consists of one hardware partition and several software partitions it is referred to as the hardware/multi-software, or HMS, system. The HMS system partitions the input specication into hardware and software while taking into account the number of inter-partition buses.

Scheduling is done in conjunction with partitioning. While this paper is mainly concerned with hardware minimization and partitioning, there are several additional factors that must be considered during hw/sw co-design. Scheduling, pin limitations and bus constraints all contribute to the complexity of the design process. Each of these is considered by our algorithms.

As an example, consider the optical wheel speed sensor system shown in Figure 1 (A) which is to be implemented in 100 clock cycles using no more than 40 square units of chip area. As with most systems, this one could be implemented using standardized processors, specialized hardware, or various combinations of both of these. Figure 1(B) shows a design of the wheel speed sensor system that has been implemented solely in software. Note that while the system was designed in only two months, it does not meet the chip area constraints or the timing constraints. Hence, while this design was easiest and fastest to build and test, it is not acceptable. Figure 1(C) shows a second design that has been implemented solely in hardware. The system surpasses both the area and timing constraints by at least 40%. In fact, it is minimum in terms of the

AREA

TIME

product. However the design cycle time has increased to nine months. In many applications, especially those in which products are sold competitively, such delays are becoming increasingly unacceptable. This is especially true as hardware advances continue to come at faster rates and the eective lifetimes of computer systems are reduced.

(3)

(B)

Processor

1 Processor

2

Processor 3

Processor 4

Processor 5

Area - 48 Units Time - 132 Cycles Design Time - 2 Months (A)

Tick to Speed Inversion

Output Encoding

FIR Filter Input

Decoding

System Constraints Area - 40 Units Time - 100 Cycles

(D)

Processor

1 Processor

2

Processor 3

ASIC

Area - 37 Units Time - 95 Cycles Design Time - 3.5 Months (C)

Area - 24 Units Time - 52 Cycles Design Time - 9 Months

Figure 1: A) Block diagram of an optical wheel speed sensor system. B)The system implemented in hardware. C)The system implemented in software. D) The system implemented using hw/sw co-design.

Figure 1 (D) shows a third design of the system which is implemented in both hardware and software. While the design is not as ecient as the design in (C), and was ready to be used slightly later than the design in (B), it establishes a balance between the two extremes. The implementation in (D) allows the designer to market his product before the competition while also meeting the technical constraints. In addition to the above trade-os, there is also an implicit trade o between the amount of hardware used (and hence the chip area used) and the response time of the nal system. Since specialized hardware adds signicantly to the design and test cycle time, our algorithms emphasize keeping the hardware to a minimum, providing that time and area constraints are met.

Since hw/sw co-design is a new area, relatively little research has been done on actually synthe- sizing an entire design. Most research has focused on particular aspects of the design process such as creating appropriate abstractions and specications of the problem [4, 3], hw/sw interfaces [7, 17], and performance estimation [5, 18]. Other research which has actually synthesized systems has done so for particular types of systems. For instance [16] covers low power systems while [1] considers telecom

(4)

systems.

A customized processor is automatically generated by Holmer and Prangle in [13], however, traditional hw/sw co-design is not performed. Given an input problem, a processor instruction set, data path, and control paths are extracted and a programmable processor is produced. The processor generated is a \combination" of hardware and software. It is a processor geared toward the problem at hand, but which should be able to solve similar problems in an ecient manner.

COSYMA [12, 2] performs hw/sw co-design using a simulated annealing partitioning algorithm. As with the HMS system all operations begin in software and are moved to hardware, however, COSYMA assumes a target architecture of one processor, one hardware component, one global bus and one global memory while the HMS system allows a variable number of buses and software units. Furthermore, the HMS system performs scheduling along with partitioning.

Gupta and DeMicheli [10, 11] perform traditional hw/sw co-design for reactive systems which have inputs whose arrival times are unknown. Their system, VULCAN II, performs partitioning by examining the inputs for each operation. Operations whose inputs have unbounded delays are called nondeterministic operations, while all other operations are deterministic. If an unbounded delay is caused by waiting for an external input the operation is called an external nondeterministic operation.

All other nondeterministic operations are internal operations. Operations are then partitioned into hardware and software largely based upon which of the above classes they are in. VULCAN II performs static scheduling for groups of operations but cannot schedule all nodes since the delay times of some inputs are unknown. Hence it also uses dynamic scheduling.

While VULCAN II performs both partitioning and scheduling, there are several important dier- ences between their system and ours. First their algorithm begins with all operations in hardware and then moves operations to software. Operations may not be moved from software to hardware.

In contrast the HMS system begins with all operations in software and then adds hardware units.

The HMS system then allows operations to be transferred from hardware to software as well as from software to hardware. A second important dierence is that our system considers a varying number of software components and buses but only one hardware component while VULCAN II presumes their is one system bus, one software component, and multiple hardware components.

Perhaps the most crucial dierence is that the HMS system is targeted towards a dierent set of applications. VULCAN II is designed for reactive systems where the arrival time of some inputs are unknown. Therefore VULCAN II uses both dynamic scheduling and static scheduling while partitioning

(5)

the system. On the other hand the HMS system is meant for iterative loops such as DSP lters in which the arrival times of all inputs are known. Hence it is able to perform a variation of rotation scheduling that allows data to be transferred back and forth between partitions several times.

Our system designs single-chip integrated circuit which can be represented as a data ow graph.

It begins with an all software implementation and then adds hardware until the area and timing constraints are met. During each iteration of the HMS system three basic steps are performed. First we decide what hardware to add to the system based upon the needability of each type of hardware.

The needability is a measure of how often an operation of this type cannot be scheduled due to resource constraints. Operations that are on the critical path are given extra weighting when calculating the needability.

The second step is to add the new hardware and transfer operations to it. We present two algorithms to do this. The rst, delayed reallocation, only adds the new hardware, while the second, immediate reallocation, also transfers groups of operations to the new hardware. Groups of operations are chosen based upon how many timesteps may be saved by transferring them, as well as the timesteps saved for the successors of these nodes. The nal step of each iteration compacts the schedule using variable partition rotation, a variation of rotation scheduling that allows nodes to be transferred between partitions. Nodes are transferred to the partition that leads to the best available timestep for all nodes.

In addition to these three steps the HMS system also veries that the bus requirements are met for each rotation. The bus scheduling problem is a variation of the bin packing problem and a modied best-t algorithm is used to solve it.

Section 2 introduces denitions and terminology used by our algorithms while Section 3 covers the assumptions of our system and introduces the main ow chart of the HMS system. Section 4 explains the HMS co-design algorithms in detail. Section 4.1 shows how the system calculates the needability of each type of unit while Section 4.1.2 explains how buses are scheduled and allocated. Section 4.2 presents the delayed reallocation and immediate reallocation algorithms. Included in this section is an explanation of variable partition rotation scheduling. Section 5 demonstrates the eectiveness of the algorithm for several input systems. Finally, Section 6 draws conclusions from the results obtained and summarizes our research.

(6)

2 Denitions and Terminology

Denition 1

A

Data Flow Graph (DFG)

is a node weighted and edge weighted directed graph

G

= (

OP;

^E

;

^T

;type;ti;de

) where

OP

=^f

o

i ^j1

i

n

^g is the set of computation nodes, or operations

E = ^f

e

l ^j1

l

E

^g, ^E

OP

, is the set of directed edges which dene the precedences from nodes in

OP

to nodes in

OP

T =^f

t

k ^j1

k

m

^g is the set of operation types

type

(

o

i), is a function from

OP

to ^T representing the type of operation

o

i

ti

(

t

k), is a function from^T to the positive integers representing the computation time of a node of type

k

de

(

e

l), is a function from^E to the nonnegative integers representing the number of delays on edge

e

l

We assume

ti

(

t

k) = 1⁸

k

for all software operations the remainder of the paper. This unit of time is referred to as one timestep.

Denition 2

A

Partitioned, Scheduled Data Flow Graph

is a DFG

G

ps = (

OP;

^E

;

^T

;

^P

;type;ti;de;part;tsp

) where each

o

i has been assigned to a partition and a timestep in which it starts to execute.

P =^f

p

j ^j1

j

P

^g is the set of partitions

part

(

o

i) is a function from

OP

to ^P representing the partition which operation

o

i is located in

tsp

(

o

i) is a function from

OP

to the positive integers representing the timestep in which operation

o

i is to begin execution

ti

(

e

l) is a function from^E to the nonnegative integers, representing the time it takes to transfer data using edge

e

l

(7)

All other denitions are the same as for an unpartitioned, unscheduled DFG. We use the notation

e

ab to denote an edge which begins at node

a

and ends at node

b

. Node

a

is said to be the predecessor of node

b

while node

b

is said to be the successor of node

a

. We assume

ti

(

e

ab) = 0 for all edges where

part

(

a

) =

part

(

b

), and

ti

(

e

ab) =

for all edges where

part

(

a

)⁶=

part

(

b

), unless noted otherwise.

is a constant for all edges. We also note that each standardized processor used in the design will correspond to one partition,

p

j, while the additional specialized hardware will also correspond to a single partition.

Denition 3

A

Retiming

of a DFG is a function from

OP

to the set of integers.

re

(

o

i) represents the number of delays moved from each incoming edge of operation

i

to each outgoing edge of operation

i

during a retiming. If

G

r is a retimed version of data ow graph

G

, then

de

r(

e

uv) =

de

(

e

uv) +

re

(

u

)^?

re

(

v

) for edge

e

uv. A retiming

r

is legal if

de

r(

e

)

>

= 0 ⁸

e

. Intuitively, this means that edges may not have a negative number of delays. Also note that the number of delays in any loop of a DFG must be greater than zero[14]. This property may not be altered by a legal retiming.

3 Main Idea

With any hardware/software co-design system it is important to remember what assumptions are made as well as what limitations are placed on the system. We brie y present these as well as the main ow chart of the system in the rst two sections. We then brie y discuss how we decide what hardware to add to the system as well as how we reschedule the DFG once extra hardware has been added.

3.1 Assumptions

Our system assumes that a DFG is given and that it is to be scheduled in a given number of timesteps.

A limit on the amount of area that may be taken up by the nal design is also given. The hardware used to build the system is of two types. Standardized, or o the shelf, processors may be used to construct the system. Systems constructed in this manner require the least design time. They also require software simulation, test and verication, as opposed to hardware simulation, fabrication and verication which are assumed to be more time consuming. However, standardized processors tend to be slower and take up more chip area than specialized hardware. Hence, our system attempts to establish a tradeo between design time compared to chip area and speed.

(8)

For the remainder of the paper we assume that the standardized hardware consists of one type of processor, while the specialized hardware consists of adders and multipliers. These assumptions simplify the explanations of how our system works as well as its implementation, however they may easily be modied. The chip area taken up by adders, multipliers, standardized processors, and global buses is also given. The area taken up by intra-partition buses is assumed to be negligible.

It is assumed that hardware runs at a faster rate than the software and that this rate is xed for all operation types. Hence if hardware multipliers are 50% faster than software multipliers then hardware adders are 50% faster than software multipliers. While hardware may be any percentage faster than software in our system, numbers that result in low integral fractions when comparing software cycles to hardware cycles are used to simplify the examples presented. In cases where there may be confusion between software timesteps and hardware timesteps, \timesteps" refers to software timesteps, while

\hardware timesteps" refers to hardware timesteps. In the case of time delays for inter-partition data transfers, all data takes one software timestep, not one hardware timestep.

3.2 The Main Algorithm of the HMS System

Figure 2 presents the main algorithm of the HMS system. The details of each subroutine are explained in Section 4. We begin by attempting to implement the entire system using standardized processors.

We use as many processors as the chip area allows, and schedule the system using list scheduling.

If the system cannot be scheduled in the required amount of time, the algorithm begins to add specialized hardware. The subroutine Most Needed Hardware is used to determine what type of hardware is to be added. If adding a unit of hardware violates the chip area constraint a standardized processor is eliminated and replaced by an equivalent amount of specialized hardware. The additional specialized hardware is equivalent to the standardized processor in that both may perform the same number of operations in a given time if both are fully utilized. For example, if the standardized processor being eliminated from the system contained six adders and two multipliers, and hardware was twice as fast as software, then three hardware adders and one hardware multiplier would be added to the system.

We then use either delayed reallocation or immediate reallocation to reschedule the data ow graph.

Both of these use variable partition rotation, a variation of rotation scheduling in which operations may be transferred between partitions. We continue this process until the time constraint is met. The algorithm used was chosen since it permits the maximum amount of o the shelf hardware to be used, thereby reducing design time.

(9)

ALGORITHM MAIN HMS(DFG,Time Desired,Total Chip Area)

Input : a DFG

Time Desired: The maximum time.

Total Chip Area. The Maximum Design Area

Output : A Scheduled Hardware/Software Design

beginwhile⁽Chip Area Used < Total Chip Area⁾ Add Standard Processor^{( );}

endTime Required ^?List Schedule( );

while ⁽Time Required < Time Desired)

New Unit Type ^?Most Needed Hardware( );

AddNewUnit(New Unit Type);

if (Chip Area Used > Total Chip Area)then Remove Standard Processor( );

Add Equivalent Hardware( );

endif (Algorithm==Delayed Reallocation)then Time Required ^?V ariable Partition Rotation( );

else if (Algorithm==Immediate Reallocation) then Move Operations to New Hardware( );

Obtain Legal Schedule( );

Time Required ^?V ariable Partition Rotation( );

endend end

Figure 2: The HMS algorithm.

(10)

3.3 Hardware Addition

During the design process new units of hardware (adders, multipliers, buses, etc.) are added to the system. We must determine what type of hardware to add. In order to do this the needability of each type of hardware is determined. The needability of a type of hardware is a measure of how often the system would like to schedule an operation of this type, but cannot due to resource constraints. The type of hardware with the greatest needability is added to the system.

3.4 Rescheduling

When new hardware units are added to the system rescheduling is done to see if the system's time constraint can be met. Two algorithms are used to reschedule the system after a new unit of hardware is introduced. Both use a variation of rotation scheduling [6] in which operations may be moved between processors. The rst algorithm, called delayed reallocation, only allows operations to be transferred during rotation scheduling.

Transferring operations in groups may be advantageous, especially when the new hardware is faster than the existing hardware. The second rescheduling algorithm, immediate reallocation, transfers operations to the new unit of hardware as soon as it is inserted. Since the new hardware unit has no operations scheduled for any timestep, operations may easily be transferred in groups. As many operations as possible are transferred to the new hardware unit. After operations have been transferred to the new hardware unit unallocated resources will exist in the software, therefore delayed reallocation must still be done in order to maximize the system throughput.

4 Description of Algorithms

4.1 Hardware Addition 4.1.1 Functional Unit Addition

When the desired time constraint cannot be met using the current hardware we must use additional hardware. In order to determine what type of hardware to add our algorithm considers the needability,

N

t^k, of each type of hardware (adder, multiplier, bus, etc.). Intuitively, the needability is an estimation of how often this type of unit is needed but unavailable in the current schedule.

In order to calculate the needability of each type of hardware, we dene partially scheduled DFGs,

(11)

hw constrained operations and hw constrained critical path operations. Given a scheduled DFG, a partially scheduled DFG at timestep

ts

is a DFG where operations for which

tsp

(

o

i)

< ts

are schedule at

tsp

(

o

i) while other operations are not assigned a timestep. An operation

o

i is a hw constrained operation at timestep

ts

with respect to a partially scheduled DFG if and only if operation

o

i could be scheduled at timestep

ts

but is not in the scheduled DFG due to hardware constraints. Similarly, operation

o

iis a hw constrained critical path operation at timestep

ts

if it is a hw constrained operation and the longest path originating at it and terminating at an output node is maximum for all nodes that have not been scheduled in the partially scheduled DFG at timestep

ts

. Note that these denitions only depend on the current schedule, not the scheduling algorithm used.

As an example consider Figure 3. Figure 3 (A) shows an example DFG which has not been scheduled, while Figure 3 (B) shows the same DFG after it has been scheduled presuming that 2 adders and 1 multiplier are available. Node 1 could be scheduled in timestep 1 if an additional adder was available. Therefore node 1 is a hw constrained operation. Similarly node 2 could be scheduled in timestep 2 if an additional multiplier was available. Node 2 is also on the critical path at timestep 2, therefore it is a hw constrained critical path operation. By not scheduling node 2 in timestep 2, the length of the schedule is guaranteed to increase.

Notice that node 3 is not a hw constrained node, even though it could be scheduled in timestep 3 if unlimited resources were available during scheduling. This is because not all of node 3's predecessors (node 2 in this case) have nished executing in Figure 3 (B) until the end of timestep 3. A similar analysis holds for hw constrained critical path nodes. They are only hw constrained critical path nodes given that they are part of the longest path of unscheduled nodes at this timestep.

The needability of operation type

t

k is dened as

N

t^k =

W

t^k ^Ptimestepsts⁼¹ (

W

hw

CHW

t^k;ts +

W

cp

CCP

t^k;ts), where

W

t^k is a weighting factor related to the chip area that a unit of type

t

k

takes up,

W

hw is a weighting factor for hw constrained operations, and

W

cp is a weighting factor for hw constrained operations that are on the critical path.

CHW

t^k;tsand

CCP

t^k;tsare simply the number of hw constrained and hw constrained critical path operations of type

t

kat timestep

ts

. The needability of buses is dened similarly and is explained further in Section 4.1.2.

As an example let us assume

W

adder = 7,

W

mult = 5,

W

hw = 1, and

W

cp = 1 in Figure 3(B).

W

adder is greater than

W

mult to re ect the fact that multipliers take up more chip area than adders.

Hence we are less hesitant about increasing the number of adders. In Figure 3,

N

adder =

W

adder

P

4ts⁼¹(

W

hw

CHW

adder;ts+

W

cp

CCP

adder;ts). Noting that

CHW

adder;ts= 0 for

ts

⁶= 1,

CHW

adder;¹ =

(12)

= Multiplier = Adder

1 2

(A) 3

(B)

Timestep

1 1

2

3

4

2

3

Figure 3: A) A sample DFG. B) The scheduled DFG presuming two adders and one multiplier.

1, and

CCP

adder;ts= 0⁸

ts

, we have

N

adder =

W

adder(

W

hw

CHW

adder;¹) = 7(11) = 7.

Similarly

CHW

mult;ts= 0 for

ts

⁶= 2,

CHW

mult;² = 1,

CCP

mult;ts= 0 for

ts

⁶= 2, and

CCP

mult;² = 1. Hence

N

mult =

W

mult ((

W

hw

CHW

mult;²) + (

W

cp

CCP

mult;²)), = 5((11) + (11)) = 10. The larger value of

N

mult compared to

N

adder points out that it would be more helpful to increase the number multipliers in the system than the number of adders. Notice that increasing the number of multipliers enables the system to be scheduled in 3 timesteps, however, no increase in the number of adders will be helpful for this system.

4.1.2 Bus Addition

Before we begin to discuss the bus scheduling algorithm used as part of the bus needability algorithm we must discuss the data transfer model used. As discussed in [15] there are many dierent ways to model data transfers. First, we must decide how long data transfers take. As noted in Section 2 we assume that all inter-partition data transfers take one time unit and all intra-partition data transfers are done between clock cycles. A second factor to consider is that of when data is transferred. One model of data transfers assumes that all data is transferred as soon as it is calculated. While this model greatly simplies the calculation of bus statistics, it is inecient since extra buses may be needed for data transfers that could be delayed. A second model of data transfer assumes that data may be transferred

(13)

at any time between when it is generated and when it is used. We refer to this as the exible transfer model. While this model is more realistic, it can be shown that optimizing data transfers for it is NP-hard [8]. We use the exible transfer model for the remainder of the paper.

The needability of buses is similar to that of hardware units. For each timestep we calculate the number of nodes that cannot be scheduled at an earlier timestep due to bus constraints. We weight the nodes that are on the critical path, and sum a weighted count of these numbers over all timesteps.

Hence

N

bus =

W

bus ^Ptimestepsts⁼¹ (

W

hw

CHW

bus;ts+

W

cp

CCP

bus;ts). If

N

bus is greater than

N

t^k

for all

k

operation types, then an additional bus is inserted instead of additional hardware.

However it is much more dicult to determine when a node may not be scheduled due to bus constraints than it is for hardware constraints. This is due to the fact that bus transfers at the current timestep depend upon bus transfers from other timesteps. Hence, a proposed schedule for the entire graph is used as input to the bus scheduling algorithm. The bus scheduling algorithm then generates a schedule for the data transfers, or a result that indicates such a schedule was not found. The main ow of this algorithm is shown in gure 4 and is applied when rotating a node (to check enough buses are available for the resulting schedule) as well as for the calculation of

N

bus. It should be noted that this algorithm is designed for systems with data transfers of variable length, not only data transfers that take one timestep, as we have assumed in the rest of the paper.

The spacing of a data transfer is a measure of the number of timesteps between it and the other data transfers on the bus. It is desirable to schedule each data transfer so that it has zero spacing. Leaving a single unused timestep for a bus makes it dicult for the bus to be used in this timestep. Leaving two or more consecutive timesteps makes it more dicult for the bus to be used eciently as compared to no timesteps but less dicult as compared to a single timestep. We dene

S

1TR;BUS;TS as the number of timesteps between the beginning of data transfer

TR

and the end of the last data transfer on bus

BUS

if the data transfer is begun at timestep

TS

.

S

2TR;BUS;TS is dened as the time between the end of data transfer

TR

and the start of the next data transfer on bus

BUS

if the data transfer is begun at timestep

TS

. We then combine

S

1 and

S

2 using the formula

S

=

X

(

S

1)

X

(

S

2) +

X

(

S

1) +

X

(

S

2), where

X

(

S

1) = 0 if

S

1 = 0 and

X

(

S

1) = _S¹¹ otherwise.

X

(

S

2) is dened similarly.

S

TR is the smallest

S

TR;BUS;TS for all bus/timestep combinations for the data transfer under consideration. The data transfer (or group of data transfers in case of a tie) with the smallest

S

TR is then chosen.

The length of a data transfer is the number of timesteps the data transfer takes (all data transfers are assumed to take integral length). The exibility of a data transfer is the number of timesteps in

(14)

ALGORITHM BUS SCHEDULING(DFG)

Input : A Proposed Node Schedule with a Set of Data Transfers The Number of Buses

Output : A Bus Schedule for the Data Transfers Or an Indication that No Schedule was Found.

begin

(L ^?All Data Transfers)

whileL⁶⁼NULL Assign Forced^();

/* Choose Transfers with Best Spacing */

for Each TransferTRinL S^T^R=Spacing(TR);

endL⁰ ^?Transfers With Best Spacing(L);

/* Choose Transfer with Worst Flexibility */

for Each TransferTRinL⁰ F^T^R ^?Flexibility(TR);

endL⁰⁰ ^?Transfers With Worst Flexibility(L⁰);

/* Choose Transfer with Longest Length */

for Each TransferTRinL⁰⁰ LE^T^R ^?Length(TR);

endTransfer Chosen ^?Longest Transfer(L⁰⁰);

Assign Transfer⁽Transfer Chosen^);

L ^?L^?Transfer Chosen endend

Figure 4: The bus scheduling algorithm.

(15)

which a data transfer can be scheduled divided by its length. Data transfers with low exibility have fewer timesteps to be scheduled in so we schedule these transfers rst. Hence, among data transfers with the same spacing, transfers with the least exibility are chosen rst. Likewise, among transfers with the same spacing and exibility those with the largest length are scheduled rst. Large data transfers are the hardest to \t" into an existing schedule while small data transfers may more easily be scheduled around existing data transfers.

As a simple example consider the data transfers in Figure 5(A). Data transfer A takes one timestep and is to be transferred between the beginning of timestep 4 and the end of timestep 5. Its length is 1 while its exibility is ²¹. Presuming no data transfers have been assigned buses and noting that all data transfers must be completed by the end of timestep ve, if transfer

A

is scheduled in timestep 4 using bus 1,

S

1A;¹;⁴ = 3 and

S

2A;¹;⁴ = 1. This re ects the fact that bus 1 has three unused timesteps between it and the end of the previous data transfer on this bus (or the rst possible transfer time which is one in this case), and one timestep between its end and the next data transfer on this bus (or the last possible time at which data transfers may end in this case). Similarly

S

1A;¹;⁵ = 4 and

S

2A;¹;⁵= 0 if transfer

A

is scheduled in timestep 5 using bus 1. Data transfer

B

takes 2 timesteps and may be transferred any time between the end of timestep 1 and the end of timestep 5. Its length is 2 while its exibility is ⁵².

S

1B;¹;¹ = 0,

S

2B;¹;¹ = 3,

S

1B;¹;² = 1,

S

2B;¹;² = 2,

S

1B;¹;³ = 2,

S

2B;¹;³ = 1,

S

1B;¹;⁴ = 3, and

S

2B;¹;⁴ = 0, The other data transfers are dened similarly.

We wish to schedule transfers A through I on buses 1, 2 and 3. The algorithm begins by assigning those data transfers which are forced to be scheduled in certain steps. This is done in subroutine

AssignForced

in gure 4. For example data transfer

C

must begin at the end of timestep 1 while data transfer

F

must begin at the end of timestep 3. Transfers

C

and

F

are shown in Figure 5 (B) using solid circles. Subroutine

AssignForced

assigns buses to data transfers using the same spacing concept that is presented below.

Next we calculate the spacing,

S

, for each data transfer.

S

1A;¹;⁴ = 2,

S

2A;¹;⁴ = 1 and

S

A;¹;⁴ =

X

(2)

X

(1) +

X

(2) +

X

(1) = (¹²)(1) + ¹² + 1 = 2. Similiarly

S

1A;¹;⁵ = 3,

S

2A;¹;⁵ = 0 and

S

A;¹;⁵ =

X

(0)

X

(3) +

X

(0) +

X

(3) = (0)(¹³) + 0 +¹³ = ¹³ while

S

1A;²;⁵ = 0,

S

2A;²;⁵= 0,

S

A;²;²= 0,

S

1A;³;⁴ = 3,

S

2A;³;⁴ = 1,

S

A;³;⁴ = 1²³,

S

1A;³;⁵ = 4,

S

2A;³;⁵ = 0,

S

A;³;⁵ = ¹⁴, and

S

A = 0. These results show that placing data transfer

A

on bus 2 in timestep 5 results in the best spacing of zero. Similiarly

S

B = 0

;S

D = 0

;S

E = 0

;S

G = ¹²

;S

H = 1

;andS

I = ¹³. Therefore data transfers

A;B;D

and

E

have the best spacings all of which are 0. Hence

L

⁰ =

A;B;D;E

. Next the exibility of

A;B;D

and

E

are calculated.

F

A = 2

;F

B = ⁵,

F

D = ³, and

F

E = 5. Since

F

D is the smallest node

D

is chosen and is

(16)

(A)

1 2

3 Transfer Length Timestep

A B C D E F G H I

1 2 1 2 1 2 3 1 2

4

5

(B)

1 2

3 Bus

Timestep

1 2 3

4

5 C

F

(C)

1 2

3 Bus

Timestep

1 2 3

4

5 C

F D

(D)

1 2

3 Bus

Timestep

1 2 3

4

5 C

F D

A I H

G B

E

Figure 5: A) The Data Transfers B) Buses Assigned After

Assign Forced

() is Called C) The Assign- ments After One Iteration D) The Final Assignments

(17)

scheduled starting at timestep 1 on bus 2, which were the timestep/bus combination that resulted in a spacing of 0. The system after transfer

D

has been assigned is shown in Figure 5 (C). In the next iteration data transfer

S

B has increased to ¹² leaving

S

A=

S

E = 0 as the minimum spacing. Since

F

A

is less than

F

E transfer

A

is scheduled next. Similiarly transfers

I;G

and

B

are assigned at which point transfers

H

and then

E

are set via

Assigned Forced

. The system with all data transfers allocated is shown in Figure 5 (D).

4.2 Scheduling

When new hardware units are added to the system, rescheduling is done to see if the system's constraints can be met. Two algorithms are used to reschedule the system. The rst, delayed reallocation, only allows operations to be moved between processors during rescheduling. The second, immediate reallocation, allows operations to be transferred between partitions as soon as the new unit of hardware is added as well as during rescheduling.

4.2.1 Delayed Reallocation

Delayed reallocation uses variable partition rotation, a variation of rotation scheduling, to reschedule the graph after new hardware is added. Rotation scheduling, which was developed by Chao, LaPaugh and Sha in [6], consists of retiming a scheduled DFG in order to obtain a DFG with a shorter schedule.

Figure 6 (A) shows an example DFG while Figure 6 (B) shows a possible initial schedule for the DFG.

In (A) the short thick lines on edges indicate delays. We assume that data transfers between partitions take 1 timestep. During rotation scheduling nodes are rotated down and then pushed up to a new position. Rotating down a node corresponds to retiming the original DFG by moving one delay from all input edges of the node to all output edges of the node. Since rotating a node or group of nodes is equivalent to retiming the original DFG all edges terminating at a node which is being rotated must contain at least one delay. As an example, rotation of nodes A and B is equivalent to pushing the delays on edges terminating at nodes A and B (

e

DA

;e

FB) to the edges originating at nodes A and B (

e

AC

;e

BD).

When nodes A and B are pushed up we look for a new timestep to place them into. During normal rotation scheduling nodes may not change partitions. Therefore node A is pushed up to timestep 6. Placing node A in timestep 6 does not violate the dependency between node A and any of its predecessors. Notice that node A cannot be placed into timestep 3, even though timestep 3 is unused

(18)

(A) D

F G

C A B

E Partition 1 Partition 2

Timestep 1

2

3

4

5

6

Software

D

F G C A B

E

Partition 1 Partition 2

(B)

Timestep 1

2

3

4

5

6

7

8

Software

Timestep 1 2 3 4 5 6 7 8 9 10 11

12

Hardware

9

D F

G C A B

E

A

B D

C

(E)

Timestep

2

3

4

5

6

Software

D

F G C A B

E

A

B 1

(D)

Timestep

2

3

4

5

6

Software

D

F G C A B

E

A B 1

(C)

7

Figure 6: A) A partitioned DFG B) A possible schedule C) A single down rotation without changing partitions D) A single down rotation using variable partition rotation E) A second down rotation with new hardware added

(19)

in partition 1, since node A must begin at least 1 timestep after node D is nished executing. Similarly node B is placed into timestep 7. The new schedule in which nodes A and B have been rotated is shown in Figure 6 (C). The copies of nodes A and B above the thick black line this is between timesteps 1 and 2 indicate that nodes A and B are now part of the prologue of the system, in addition to being part of the repeating, static schedule that is shown for timesteps 2 to 7.

Delayed reallocation uses a variation of rotation scheduling in which nodes may be transferred between partitions to reschedule operations on the new hardware. This process is referred to as variable partition rotation. During variable partition rotation nodes may be transferred between partitions as long as changing partitions does not result in a later schedule time. As an example consider Figure 6 (D), in which the same rotation as in (C) is shown. Node A, which was rotated to timestep 6 in (C), may now be rotated to timestep 3 if it is placed in partition 2. Similarly, node B, which was rotated to timestep 7 in (C), may now be placed into timestep 6 if placed into partition 1. Notice that after variable partition rotation the DFG may be scheduled in 5 timesteps, while it required 6 timesteps after normal rotation scheduling.

During each down rotation of the variable partition rotation routine the algorithm in Figure 7 is applied. Let us consider the down rotation algorithm in more detail. During each rotation we rotate down all nodes that are in the rst timestep. These nodes are placed in list

L

. For each node in

L

we calculate the best timestep that it can be pushed up to by using the subroutine Best Timeslot. We then choose the nodes that may be scheduled the earliest and place them in list

L

⁰. If there is more than one node in

L

⁰, we choose the node from

L

⁰ which would be scheduled the latest if it were not placed in this timestep. The subroutine Second Best Timeslot is used to calculate the second best timestep that a node could be scheduled in. Once we have chosen the best node from

L

⁰ we decide which partition to place it into. In most cases this is trivial since it may only be placed into one partition and still be scheduled as soon as possible. However, in situations where the best node may be scheduled at the same time in two or more partitions, the subroutine Percentage Used is used to calculate the percent of timesteps currently used for each of these partitions. The partition with the least percent of timesteps used is then chosen. Once we have chosen what node to rotate, what timestep to place it into, and what partition to place it into, we rotate the node and remove it from list

L

. This process continues until rotating down all nodes of the DFG does not result in a change in the schedule length. Hence the entire graph may need to be rotated down several times before the minimal schedule is obtained.

As an example consider Figure 6(E), in which a new, faster hardware unit has been added to the system. The hardware is 50% faster than the software. By 50% faster, it is meant that each hardware

(20)

ALGORITHM VARIABLE PARTITION DOWN ROTATE(DFG)

Input : A DFG

Output :A DFG which has been rotated down once.

begin

(L ^?Nodes in First Timestep(DFG))

while(L⁶=NULL)

/* Choose Nodes That May Be Pushed Up To the Highest Timeslot */

for Each Node inL

T^Node⁼Best Timeslot⁽Node);

endL⁰ ^?Nodes With Best Timeslot(L);

/* Calculate Where Node Would Be Placed If Not In this Timeslot */

for Each Node inL⁰

T^Node ^?Second Best Timeslot(Node);

endNode Chosen ^?Node With Worst Timeslot(L⁰);

Timeslot Into ^?Best Timeslot(Node Chosen);

/* Decide What Partition to Place the Node In */

for All Partitions For WhichNode ChosenMay Be Placed intoTimeslot Into Used^Partition ^?Percent Used(Partition);

endPartition Chosen ^?Least Used Partition();

Rotate It(Node Chosen;Timeslot Into;Partition Into);

L ^?L^?Node Chosen endend

Figure 7: The variable partition down rotation algorithm.

functional unit executes three operations in the time that each software functional unit executes two operations. We now use variable partition rotation to rotate down nodes C and D from Figure 6(D).

Node C is scheduled rst since it may be pushed up to (software) timestep 4 while node D may not be scheduled until timestep 8. Node C is placed into timestep 4 in partition 2. Next node D is rotated.

Since both nodes B and G are predecessors of D, it may not be scheduled in timestep 7 in either partition 1 or 2. Hence we must decide which partition to place it into. We may place it in partition 1 or 2, in which case it will nish at the end of timestep 8, or in the hardware in which case it will nish at the same time. The hardware is chosen in this case since it is the least used of the three partitions.

4.2.2 Immediate Reallocation

Immediate reallocation consists of three steps, moving operations to the new hardware, obtaining a legal schedule, and variable partition rotation. The rst step, moving operations to the new hardware, takes place as soon as a new unit of hardware is added to the system. During this step groups of operations

(21)

= Multiplier

= Adder Software 1 Multiplier 2 Adders Hardware 1 Adder 33% Faster

(A)

Timestep 1

2

3 4

(B) 1

5

2

3

4 5

6

Software Timestep

1

2

3 4

(C)

5

Hardware

1

2

3

4 1

2

3

4 5

6

(D)

Hardware Software

Timestep 1

2

3 4

5

1 2 3 4 5

6

Timestep 1 2

3 4 5 6

(E)

Hardware Software

Timestep 1

2

3 4

1 2 3 5 4

6

Timestep 1 2

3 4 5

Figure 8: A) Key. B) The scheduled DFG. C) The scheduled DFG with four nodes selected to move to the new adder. D) The scheduled DFG after four nodes have been moved to the new adder. E) The nal DFG after pushing down nodes 5 and 6 and variable partition rotation.

are transferred to the new hardware. As many groups as possible are transferred, with the criteria for deciding what groups to be transferred explained below. Since transferring groups of nodes may result in new communication delays that may lead to an illegal graph, some nodes may need to be \pushed down" in order to obtain a legal schedule. This is done in the second step of immediate reallocation.

Finally, since the software that the operations were transferred from will now have unused hardware, variable partition rotation is used to nd a shorter schedule in the third step of immediate reallocation.

Since the new hardware unit is faster than the software that is already in place, it may be benecial to transfer groups of operations to the new hardware. As an example, consider Figure 8 (B) which shows an example DFG that has been scheduled on a standard processor which has 2 adders and 1 multiplier. It has been decided to add an adder which is 33% faster than the standard processor to the system.

(22)

Figure 8 (C) shows the same gure but with the extra adder inserted in the system. In (C) a group of four nodes have been selected to be transferred to the new adder. By transferring four nodes at a time we may \save" an operation since four operations may now be t into three (software) timesteps.

Figure 8 (D) shows the DFG after the four nodes have been put into three timesteps and the connections between nodes in the hardware and the standardized processor reestablished.

Notice that node 5 cannot be scheduled in timestep 2 as it was in (B) since we assume data transfers between dierent partitions take 1 timestep. Therefore, in (D), nodes 5 and 6 are not scheduled. This is denoted with a dashed line for nodes 5 and 6 and their edges. They will be rescheduled during the second step of immediate reallocation using a simple scheduling algorithm that will \push them down"

to the next available timestep so that a legal schedule may be obtained. In this case nodes 5 and 6 may be pushed down to timesteps 3 and 4. Note that when we are obtaining a legal schedule we only push down those portions of the system that are not legal. The whole system is not rescheduled.

While we have managed to save an addition by using the faster hardware, we have been forced to delay the execution of nodes 5 and 6 by one time unit. Hence, there is a tradeo when moving nodes to the new hardware. We will decrease the time it takes for some operations, but others may increase.

When transferring groups of operations we look for those whose successors may be scheduled earlier if the proposed transfer takes place.

After obtaining a legal schedule, variable partition rotation, as explained in Section 4.2.1, is used to arrive at the nal schedule. Figure 8 (E) shows the nal DFG after variable partition rotation. The nal DFG takes four timesteps to execute while the original DFG took ve timesteps.

Let us examine the steps of immediate reallocation in more detail. The rst step of immediate reallocation involves transferring nodes to the new hardware. In order to do this we rst decide what nodes to transfer. If the hardware is P% faster than the software at least ¹⁰⁰_P + 1 operations must be transferred to the hardware in order to save an operation. For example, if the hardware is 33.3% faster than the software, then ³³¹⁰⁰_:³ + 1 = 4, operations must be moved to the hardware in order to save a timestep.

In order to decide what operations to transfer we calculate the time dierential,

TD

that results from transferring dierent groups of operations. The group of nodes with the greatest time dierential is transferred to the new hardware. Three factors are considered when calculating the time dierential:

the number of operations transferred to the new hardware,

O

t, the number of operations saved,

O

s, and the saved successor time of each path aected by the transfer,

st

i. The time dierential is dened

(23)

as

TD

=^P^successors_i⁼¹

st

i+^O^s^=O^t.

The term ^O^s^=O^t is a second-order term used to resolve ties when the saved successor times are equal.

O

t is simply the number of operations transferred to the new hardware.

O

s is the number of operations transferred to the new hardware for which

tsp

(

o

i) =

max

^k_k⁼⁼¹^O^t(

tsp

(

o

k)). In other words

O

s is the number of operations that are scheduled in the greatest timestep of all operations being transferred.

is a constant which is set large enough (approximately 10) so that the term ^O^s^=O^t

<

1.

This ensures that this term will only make a dierence when the sum of the saved successor times are equal. Intuitively, this term represents the percent of operations transferred to the new hardware that have nished executing at an earlier timestep.

The saved successor time represents the number of timesteps that the successors of nodes being transferred could be moved up if unlimited hardware were available. We use the notation

tse

(

o

i) to denote the timestep which operation

i

(a successor of one of the nodes being transferred to the new hardware) would be scheduled at if unlimited hardware were available given that the nodes transferred to the new hardware have been scheduled. Hence

st

i =

tse

(

o

i)old schedule^?

tse

(

o

i)new schedule.

As an example consider Figure 9. As noted in (A) the new hardware, an adder, is 50% faster than the standardized processor. Hence we must transfer at least ¹⁰⁰⁵⁰ + 1 = 3 operations to the adder in order to save a timestep. First let us consider transferring nodes 2, 3 and 5 to the new hardware (

O

t= 3). If this were done, node 2 would have to be delayed 1 time unit since it would be in a dierent partition than node 1. Furthermore, node 4 would be delayed an additional time unit because of the data transfer between nodes 2 and 4. Hence

st

⁴=^?2.

Similarly, node 8 would need to be delayed two timesteps since the data transfer between it and node 3 would take an additional timestep. On the other hand we would save one time unit by scheduling nodes 2, 3, and 5 on the faster hardware (

O

s = 1). However, we would also lose one unit of time on the transfer of data from node 5 to node 6, Hence

st

⁶ = ^?1 and

st

⁸ = ^?2, and if

= 10,

DT

= ¹¹⁰⁼³ + (^?2 +^?2 +^?1) = ^?4²⁹³⁰. Overall, we save one operation by moving three nodes but we increase the length of three paths by doing so. Therefore, nodes 2, 3 and 5 are likely not a good choice. In general, nodes that have input that will not be available at the timestep they are currently scheduled in if they are moved to a dierent partition, such as node 2 in this case, are not a good choice . These nodes cannot be scheduled on the faster hardware until one timestep after they were previously scheduled due to the delay associated with transferring data between partitions.

As a second example let us consider transferring nodes 1, 2 and 3 to the hardware. In this case all

(24)

= Multiplier

= Adder Software 1 Multiplier 2 Adders

(A) Hardware 1 Adder 50% Faster

(B)

Timestep 1

2

3

4

5

2 3 4

5 6

7

8

1

5 6 Software

7

9

8

10 9 11

Timestep Timestep

1

2

3

4

(C)

5

2 3 4

6

7

8

1

5 6

7 Software

1 2 3 4 5 6 7 8 9 10 11 12

Hardware

9

8

9 10 5 11

Timestep 1

2

3

4

(D)

5

2 3 4

6

7

8

1

5 6

7 Software

Timestep 1

2 3 4 5 6 7 8 9 10 11 12

Hardware

9

8

9 10 115

Figure 9: A) Key B) The scheduled DFG. C) The scheduled DFG with three nodes moved to the new hardware. D) The scheduled DFG with six nodes moved to the new hardware.

CiteSeerX — Hardware/Software Co-design With the HMS Framework

Hardware/Software Co-design With the HMS Framework

Dept. of Computer Science & Engineering University of Notre Dame

Notre Dame, IN 46556

ABSTRACT

1 Hardware/Software Co-design Introduction

AREA

TIME

2 De nitions and Terminology

Data Flow Graph (DFG)

G

OP;

;

;type;ti;de

OP

o

i

n

e

l

E

OP

OP

OP

OP

t

k

m

type

o

OP

o

ti

t

k

de

e

e

ti

t

k

Partitioned, Scheduled Data Flow Graph

G

OP;

;

;

;type;ti;de;part;tsp

o

p

j

P

part

o

OP

o

tsp

o

OP

o

ti

e

e

e

a

b

a

b

b

a

ti

e

part

a

part

b

ti

e

part

a

part

2 Denitions and Terminology