CHESS: A Tool for CDFG Extraction and High- Level Synthesis of VLSI Systems

(1)

University of South Florida

Scholar Commons

Graduate Theses and Dissertations Graduate School

7-8-2003

CHESS: A Tool for CDFG Extraction and High-

Level Synthesis of VLSI Systems

Ravi K. Namballa

University of South Florida

Follow this and additional works at:https://scholarcommons.usf.edu/etd Part of theAmerican Studies Commons

This Thesis is brought to you for free and open access by the Graduate School at Scholar Commons. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Scholar Commons. For more information, please contact[email protected].

Scholar Commons Citation

Namballa, Ravi K., "CHESS: A Tool for CDFG Extraction and High-Level Synthesis of VLSI Systems" (2003). Graduate Theses and Dissertations.

https://scholarcommons.usf.edu/etd/1439

(2)

CHESS: A TOOL FOR CDFG EXTRACTION AND HIGH-LEVEL SYNTHESIS OF VLSI SYSTEMS

by

RAVI K NAMBALLA

A thesis submitted in partial fulfillment of the requirements for the degree of

Master of Science

Department of Computer Science and Engineering College of Engineering

University of South Florida

Major Professor: N.Ranganathan, Ph.D.

Murali Varanasi, Ph.D.

Abdel Ejnoui, Ph.D.

Date of Approval:

July 8, 2003

Keywords: High-Level Synthesis, Resource Optimization, Low Power Binding, CDFG Extraction, Tabu Search, Game Theory

(3)

DEDICATION

To My Mother

(4)

ACKNOWLEDGEMENTS

I would like to express gratitude to my major professor, Dr. N. Ranganathan, for his encour- agement, guidance, support and friendship throughout my Master’s program. Without his patience and his valuable suggestions, this thesis would not have been completed. I would also like to thank Dr. Varanasi and Dr. Abdel for guiding me as my committee members.

I would also like to thank Ashok Murugavel for his ideas and his help throughout my thesis work. I wish to thank Sarju Mohanty for providing me with his collection of related works in VLSI.

I would also like to thank all members of VCAPP group for their help and support.

I really appreciate the invaluable support that I received from my brother without which this work would not have been possible. Also, I would like to acknowledge the support of my room- mates and friends.

(5)

TABLE OF CONTENTS

LIST OF TABLES iii

LIST OF FIGURES iv

ABSTRACT vi

CHAPTER 1 INTRODUCTION 1

1.1 System Description and Intermediate Representation 2

1.2 Scheduling 3

1.2.1 Time-Constrained Scheduling 4

1.2.2 Resource Constrained Scheduling 5

1.2.3 Other Scheduling Approaches 6

1.3 Allocation and Binding 7

1.4 Motivation for Our Thesis 7

1.5 Thesis Outline 8

CHAPTER 2 RELATED WORK 10

2.1 Compiler-Level Transformations in High-Level Synthesis 10

2.2 DFG-Based Works 13

2.2.1 Scheduling 14

2.2.2 Allocation and Binding 18

2.3 CDFG-Based Works 20

2.3.1 Scheduling 20

2.3.2 Allocation/Binding 24

2.3.3 Our Work 25

CHAPTER 3 CDFG EXTRACTION FROM VHDL 26

3.1 Introduction 26

3.2 Preliminaries 28

3.2.1 Control and Data Flow Graph 28

3.2.2 Basic VHDL Constructs 29

3.3 Implementation Details 31

3.3.1 Methodology 32

3.3.2 Algorithmic Description 32

3.3.3 Transforming VHDL Constructs 36

3.3.3.1 Operational Statements 36

3.3.3.2 Assignment Statements 37

3.3.3.3 Conditional Statements 41

3.3.3.4 Loop Statements 43

3.3.4 Output Formats 45

(6)

3.3.4.1 Adjacency-List Representation 45

3.3.4.2 Set Representation 45

3.3.4.3 Visual Representation 45

3.4 Unhandled Features 46

3.5 Summary 46

CHAPTER 4 SCHEDULING THE CDFG 47

4.1 Introduction 47

4.2 Mutual Exclusion Among Operations 50

4.3 Penalty Weights 52

4.4 Scheduling Algorithm 54

CHAPTER 5 POWER-OPTIMIZED BINDING 57

5.1 Power Optimization During Binding 57

5.2 Basic Concepts 59

5.2.1 Game Theory 59

5.2.2 Auction Theory 60

5.3 Problem Formulation 60

5.3.1 Algorithmic Description 63

5.4 Summary 64

CHAPTER 6 EXPERIMENTAL RESULTS 66

6.1 CDFG Extraction From Behavioral VHDL 66

6.2 Scheduling 67

6.2.1 The Differential Equation Benchmark 72

6.2.2 Elliptic Filter 75

6.3 Binding 78

CHAPTER 7 CONCLUSIONS 80

REFERENCES 81

(7)

LIST OF TABLES

Table 6.1. Experimental Results for CDFG Extraction From Behavioral VHDL Specifi-

cation 67

Table 6.2. Comparison of Schedules for the Differential Equation Benchmark Circuit 75 Table 6.3. Comparison of Schedules for the Elliptic Filter Benchmark Circuit 75

Table 6.4. Power and Delay Values of the Library Cells 79

Table 6.5. Comparison of Binding Results 79

(8)

LIST OF FIGURES

Figure 1.1. VLSI Design Flow 2

Figure 2.1. Taxonomy of Related Works in High-Level Synthesis 11

Figure 3.1. CDFG Representation 30

Figure 3.2. Steps Involved in Extraction of the CDFG From VHDL Code 33

Figure 3.3. Extraction of CDFG From a Sample VHDL Code 34

Figure 3.4. Algorithm for CDFG Extraction 35

Figure 3.5. CDFG: AND Operation 37

Figure 3.6. CDFG: Variable Assignment 37

Figure 3.7. CDFG: Signal Assignment 38

Figure 3.8. CDFG: If-Then-Else Statement 38

Figure 3.9. CDFG: Loop Statement 39

Figure 4.1. Mutually Exclusive Nodes 51

Figure 4.2. Penalty Weights 55

Figure 4.3. Life-Time and Number of Buses From CDFG 56

Figure 4.4. Scheduling Algorithm 56

Figure 5.1. Scheduled CDFG and its Binding Matrix 58

Figure 5.2. Algorithm for Finding the Nash Equilibrium 64

Figure 5.3. Algorithm for Finding the Cost Matrix 65

Figure 5.4. Binding Algorithm 65

Figure 6.1. CDFG Extracted for the Differential Equation Benchmark Circuit 68 Figure 6.2. CDFG Extracted for the Elliptic Filter Benchmark Circuit 69 Figure 6.3. CDFG Extracted for the Greatest Common Divisor Benchmark Circuit 70 Figure 6.4. CDFG Extracted for the Fast Fourier Transform Benchmark Circuit 71

(9)

Figure 6.5. ASAP Schedule for Differential Equation Benchmark 72 Figure 6.6. ALAP Schedule for Differential Equation Benchmark 73 Figure 6.7. Optimal Schedule for the Differential Equation Benchmark in 4 Control Steps 73 Figure 6.8. Schedule for the Differential Equation Benchmark in 6 Control Steps 74 Figure 6.9. Scheduled CDFG for the Elliptic Filter Benchmark Circuit 76

Figure 6.10. Improvement in Memory Requirement 78

(10)

CHESS: A TOOL FOR CDFG EXTRACTION AND HIGH-LEVEL SYNTHESIS OF VLSI SYSTEMS

RAVI K NAMBALLA ABSTRACT

In this thesis, a new tool, named CHESS, is designed and developed for control and data- flow graph (CDFG) extraction and the high-level synthesis of VLSI systems. The tool consists of three individual modules for:(i) CDFG extraction, (ii) scheduling and allocation of the CDFG, and (iii) binding, which are integrated to form a comprehensive high-level synthesis system. The first module for CDFG extraction includes a new algorithm in which certain compiler-level transformations are applied first, followed by a series of behavioral-preserving transformations on the given VHDL description. Experimental results indicate that the proposed conversion tool is quite accurate and fast. The CDFG is fed to the second module which schedules it for resource optimization under a given set of time constraints. The scheduling algorithm is an improvement over the Tabu Search based algorithm described in [6] in terms of execution time. The improvement is achieved by moving the step of identifying mutually exclusive operations to the CDFG extraction phase, which, otherwise, is normally done during scheduling. The last module of the proposed tool implements a new binding algorithm based on a game-theoretic approach. The problem of binding is formulated as a non-cooperative finite game, for which a Nash-Equilibrium function is applied to achieve a power-optimized binding solution. Experimental results for several high-level synthesis benchmarks are presented which establish the efficacy of the proposed synthesis tool.

(11)

CHAPTER 1 INTRODUCTION

VLSI technology has advanced to a level where it would be extremely difficult to design digital systems starting at the transistor level or at the physical level. The increasing complexity of the designs and the ever growing competitiveness in the design market have made inevitable, the need to take the design process to much higher levels of abstraction where the design tradeoffs of time and efficiency could be carefully evaluated by the design engineer. This led to the automation of the design process based on a top-down methodology starting from the conceptualization of the design to its realization on silicon. Now, VLSI technology has gradually evolved to a point where the high-level synthesis of VLSI design systems has become more cost effective and less time consuming than the traditional method of designing everything by hand.

A typical VLSI design flow is shown in figure 1.1.. The first level of the design flow is the systems level specification, which is the most abstract form of representation of the design and mostly gives its description in plain English. The next level is the behavioral description which gives a functional description of the design while avoiding the structural details of the design.

The RTL description, on the other hand, is composed of instances of modules such as adders, multipliers, registers, etc. that provide the structural details of the design. The process of translation of a behavioral description into a structural description is termed as High-Level Synthesis.

The process of synthesizing an RTL structure from the functional description during the high- level synthesis involves three phases:

Allocation: determining the number of instances of each resource needed.

Binding: assignment of resources to computational operations.

Scheduling: timing of computational operations.

(12)

System Specification

Behavioral Description

System−Level Design

RTL Description

Gate Level Description

Physical Layout

High − Level Synthesis

Logic synthesis

Layout Synthesis

Transformation Compilation

Scheduling

Allocation/Binding

Figure 1.1.. VLSI Design Flow

High-level synthesis starts at the system level and proceeds downwards to RTL level, passing through each of the above phases, each time adding some additional information needed at the next level.

Behavioral synthesis requires transformation of the VHDL code into an internal representation which extracts registers, combinational logic equations and macros like ’+’, ’-’, etc. for the scheduling, allocation and binding processes. Most systems use a representation like the control flow graph and/or the data flow graph or the combination of the two like the CDFG as their intermediate format.

1.1 System Description and Intermediate Representation

The system to be designed is described at the most abstract level in plain English, i.e., in a form most easily understood by the user. The behavior of the system is captured at the algorithmic level

(13)

through a programming language such as Ada, Pascal , or a hardware description language such as VHDL, HardwareC [73] , MIMOLA [114] and SILAGE [40].

The behavior of the system, specified in a high-level language, is compiled into a internal representation that would be suitable for the rest of the synthesis process. The transformation of the behavioral specification into its unique graphical representation is analogous to the non-optimizing compilation of a programming language.

The data representation adopted by several behavioral synthesis systems may vary slightly in style and structure, but, in general, the control and data dependencies are encapsulated in one or two graphs. The data flow graph is a directed graph which depicts the flow of data, while the control flow graph is a directed graph which indicates the sequence of operations.

1.2 Scheduling

Scheduling is defined as that step in high-level synthesis in which the operations are grouped into control-steps based on their types and dependencies in such a way that the operators in the same control step could be executed simultaneously. A wide variety of approaches exist in efficient scheduling which are directed at either reducing the total time of execution or minimizing the number of resources needed for the design. Broadly, these approaches could be classified into four categories: Basic scheduling, time constrained scheduling, resource constrained scheduling and miscellaneous scheduling.

The control and data flow graphs depict the inherent parallelism in a design, based on which, each node could be assigned a range of control steps. Most of the scheduling algorithms require the earliest and the latest bounds that define the range of control steps for each node in the CDFG.

Two simple schemes that are widely used to determine these bounds are called the As Soon As Possible (ASAP) and the As Late As Possible (ALAP) algorithms.

The ASAP algorithm begins with scheduling the initial nodes, i.e. nodes without any predecessors, in the first time step, and assigns the time steps in increasing order as it proceeds downwards.

The algorithm is guided by the simple principle that a particular node can be executed only if all of its predecessors have been executed. Ignoring resource constraints, this algorithm gives the least

(14)

number of control steps required for the design, and hence, could be used for near-optimal micro code compilation [84] .

The ALAP algorithm is analogous to the ASAP scheme, except that the operations here are intentionally postponed to the latest possible control step. The algorithm begins at the bottom of the CDFG, i.e., with nodes that have no successors, and proceeds upwards to nodes that have no predecessors. This algorithm gives the slowest possible schedule for a given design.

1.2.1 Time-Constrained Scheduling

The time-constrained scheduling approach is often adopted for designs targeted towards applications in real-time systems, like the digital signal processing systems, which are often limited by the response time. Here, the main objective would be to realize the design with minimum possible hardware while meeting the time constraint. Time constrained scheduling is usually implemented using three different techniques: - Mathematical programming - Constructive heuristics - Iterative Refinement.

Integer Linear Programming The ILP method is a mathematical formulation of the scheduling problem, which applies a branch-and-bound search algorithm with backtracking to find the optimal schedule.

whenever

(1.1) The ILP approach begins with finding the earliest ( ) and the latest ( ) time-bounds for each operation using the ASAP and ALAP algorithms respectively. From these, the mobility range for each operation is calculated as

! #"

$&%(')%+*

(1.2) and the scheduling problem is formulated by the equation,

Minimize

-,

.

0/21

43

657&

8

and

.

9;:=<?>:

?@BA DCEGFHC

%IJ%LK

no. of operations

(1.3)

(15)

where, ^% ^% operation types are available, and⁷ is the number of functional units of operation type k, and ³ is the cost of each FU. ^@ ^A is equal to 1 if the operation i is assigned in control step j and 0, otherwise.

Such a formulation could be extended to further include resource and data dependency constraints using the equations,

,

.@

/21

;@ A

%7

@

and ⁸ ⁵ ^A ⁵ ^;@ ^A ⁸ ^% CE

(1.4)

where and are the control steps assigned to the operations^@ and respectively.

One major drawback of the ILP formulation is that its complexity increases rapidly with the number of control steps. For a single additional control step, n additional x variables have to be considered. The ILP approach is computationally intensive and hence, can be applied only to very small problems.

One other approach for time constrained scheduling is a heuristic method, called the Force directed scheduling. This algorithm tries to reduce the total number of functional units used by uniformly distributing the operations of the same type over the available control steps.

1.2.2 Resource Constrained Scheduling

Resource constrained scheduling algorithms are used in applications where the design is restricted by the silicon area. The goal of these algorithms is to minimize the number of control steps while satisfying the resource constraints. The schedule is built one operation at a time, so that the resource constraints and data dependencies are not violated. The total number of control steps are minimized in such a way that the number of operations scheduled in any control step does not exceed the number of FUs available.

Two popular approaches for scheduling operations with resource constraints include list-based scheduling and static-list scheduling.

List-based scheduling is based on including resource constraints in the ASAP algorithm. A priority list of ready nodes is maintained, and each such list is associated with a priority function that resolves any resource conflicts. A ready node is a node whose predecessors have already been scheduled.

(16)

The algorithm proceeds by first scheduling operations with higher priority while the lower priority operations are deferred to later control steps. At every step, the successors of a scheduled operation are added to the priority list of ready nodes.

The efficiency of such a list scheduling algorithm depends mostly on the priority function employed. A simple priority function could be chosen as to assign a priority that is inversely proportional to the mobility of the operation, and thereby, ensure that operations with large mobility are deferred to later control steps since they could go into more number of control steps.

Alternatively, we could assign a priority based on the length of the longest path from the operation node to a node with no immediate successor. One major drawback of the list-based scheduling is the increased time and space complexity because of the several lists that have to be maintained dynamically.

The static-list scheduling is based on building a single list of operations statically, as opposed to the normal list-based scheduling, where the list grows dynamically. The ASAP and ALAP algorithms are applied initially to find the mobility range for each operation. The operations are sorted in ascending order based on their greatest control step assignment, and then, the operations with the same greatest control step value are sorted in descending order of their least control step value. The operations are then scheduled sequentially in the descending order of their priority. The operations that cannot be scheduled in a control step due to unavailability of resources are deferred to later control steps.

1.2.3 Other Scheduling Approaches

Apart from the previously discussed scheduling algorithms, several other approaches, like the Simulated Annealing, have been successfully used to solve the scheduling problem.

In the Simulated Annealing based approach [26], scheduling is treated as a placement problem, where the operations are to be placed in a two-dimensional table of control steps versus available functional units. The algorithm begins with an initial placement of operations, and iteratively modifies the table by displacing an operation. The new schedule is evaluated based on the cost of displacement, and is accepted with a probability, even when it may not be better from the previous one, in order to overcome local minima in the solution space. The simulated annealing approach

(17)

is, thus, suitable for obtaining globally optimum solutions, but, requires long execution times for finding them.

Another approach is the Path-based scheduling [13], which is based on minimizing the number of control steps needed to execute the critical path in a CDFG. Initially, all possible paths of the CDFG are extracted and scheduled independently and later these schedules are combined to get the final schedule. The algorithm transforms the problem of introducing minimum control step constraint into a clique-partitioning problem. A clique partitioning solution would indicate the minimum overlapping of intervals in a given path.

1.3 Allocation and Binding

Allocation is the process of determining the functional units of each type for performing the operations while binding includes the process of assigning each such operation to a particular functional unit. Allocation ensures that sufficient number of resources are available for executing the operations and binding decides the actual components to be used for each operation. Binding has an impact on the amount of multiplexing and interconnections in the final design.

Allocation and binding could be classified into three categories based on their objective: allocation and binding for functional units, for memory units, and for interconnections. Allocation and binding for functional units consists of grouping operations in such a way that each group consists of mutually exclusive operations while the total number of groups is minimized. In memory unit allocation/binding, values that are generated in one control step and used in another are assigned to memory units for storage. Here, the objective is to minimize the number of memory units and also to simplify the communication paths. Interconnection allocation and binding includes assignment of buses, multiplexer and de-multiplexer connections to perform the data transfer in each time step.

1.4 Motivation for Our Thesis

The automation of design process has been deemed necessary by the increasing complexity of the designs and the decreasing marketing-time requirements of the design market. Shifting the design process to higher levels of abstraction has been the motivating factor for several research works in the High-level synthesis phase. Despite the availability of several tools for synthesizing

(18)

behavioral descriptions of designs, their application in research work is quite limited since most of them are commercially-oriented tools. Moreover, most of the previous works on high-level synthesis target data-dominated designs, but, are not adequate enough to handle control-dominated designs. Control-flow intensive behaviors with inherent loops and conditionals are quite possible in network-centered systems. This has motivated us to develop a comprehensive high-level synthesis system that could be used for both data-flow and control-flow intensive designs. The system, generating outputs at different stages of the synthesis process, aids researchers by providing them with the flexibility of several entry and exit points in the system.

The high-level synthesis process requires the compilation of the behavioral description of the design into a graphical representation, capturing the control and data dependencies. The derivation of such a Control and Data Flow Graph (CDFG) has been done mostly manually, which makes this process time-consuming and error-prone at least in the earlier stages of synthesis. Our synthesis system, therefore, includes a tool for automatic conversion of a given behavioral VHDL description into its corresponding CDFG. Such a CDFG is generated in several formats to accommodate different implementation approaches.

Traditionally, the design automation tools were developed with the objective of reducing area and improving the speed of designs. However, with the introduction of portable wireless devices and other micro equipment like laptop computers, power dissipation of the circuits has slowly evolved as a major concern of the design process. Such a trend has placed the problem of power optimization in the early design cycle. We have addressed this problem of power optimization in the binding phase of our synthesis system.

1.5 Thesis Outline

The rest of the thesis is organized as follows: We enumerate some of the previous works related to this field in chapter 2. An automatic conversion tool that is used to extract a CDFG from the given behavioral description is described in chapter 3. Chapter 4 gives a brief overview of the scheduling approach used in our synthesis system. Chapter 5 describes our game-theory based binding algorithm that incorporates power optimization. Experimental results obtained upon some

(19)

of the standard high-level synthesis benchmark circuits are presented in chapter 6. Finally, we give the concluding remarks in chapter 7.

(20)

CHAPTER 2 RELATED WORK

The advent of design automation has resulted in a significant amount of work at many levels of design abstraction. A number of techniques have been proposed for high-level synthesis, some of which are briefly discussed here. A taxonomy of related works in High-level synthesis(HLS) has been provided in Figure 2.1.. The related works are classified on the basis of the intermediate representation they use (DFG or CDFG), and the tasks that they target in HLS (scheduling, allocation or binding). We have also enumerated works on transformations of initial behavioral descriptions.

We now present a summary of these works according to our classification.

2.1 Compiler-Level Transformations in High-Level Synthesis

In this section, we cite various works on compiler-level transformations of original behavioral descriptions that aid in the next steps of HLS.

Aho et al. [7] proposed the application of several compiler optimization techniques, such as constant folding and redundant operator elimination, on the flow-graph representation. Arrayed variables were another source of compiler-level optimizations for HLS considered in [36] and [79].

Since arrays in the behavioral descriptions get mapped to memories, it was proposed in [55] that reducing the number of array accesses decreases the overhead resulting from accessing memory structures.

Lis and Gajski [67] identified some of the advantages of capturing design requirements in a behavioral form. These include:

Technology dependent details of implementation are not embedded in the design specification.

(21)

Scheduling

Park et al. 1991 Jain et al. 1991 Paulin et al. 1989 Devadas et al. 1989 Pangrle et al. 1987 Genotys et al. 1987 Tseng et al. 1986 Marwedel 1986 Davidson et al. 1981

Achatz et al. 1993 Ly et al. 1993 Aloqeely et al. 1994 Chaudhuri et al. 1994 Gebotys et al. 1994 Lanner et al. 1994 Wang et al. 1994 Dhodhi et al. 1995 Kawaguchi et al. 1995 Kim et al. 1995 Lee et al. 1995 Sharma et al. 1995 Unaltuna et al. 1995 Wilson et al. 1995 Raghunathan et al. 1997 Gruian et al. 1998 Lakshminarayana et al. 1998 Park et al. 1999

Crenshaw et al. 1999

Prabhakaran et al. 1999 Benini et al. 2000 Shiue et al. 2000 Sllame et al. 2002 Mohanty et al. 2003 Thepayasuwan et al. 2003

Chang et al. 1996 Chang et al. 1995 Tsay et al. 1990 Kucukcakar et al. 1990 Huang et al. 1990 Paulin et al. 1989 Kurdahi et al. 1987 Tseng et al. 1986 Hitchcock et al. 1983

Srikantam et al. 2000 Shiue et al. 2000 Agarwal et al. 2001 Kumar et al. 2000 Hong et al. 2000 Crenshaw et al. 1999 DasGupta et al. 1998 Gebotys et al. 1997 Ferguson et al. 1996 Allocation/Binding

Rosien et al. 2002 Mendias et al. 2002 Grant et al. 1991 Goosens et al. 1990 Bhaskar et al. 1990 Walker et al. 1989 Hartley et al. 1989 Park et al. 1988 Girczyc et al. 1987 Rosentiel et al. 1986 Orailogulu et al. 1986 Aho et al. 1986

Mekenkamp et al. 1996 Potkonjak et al. 1995 Lee et al. 1994 Kolson et al. 1994 Chaiyakul et al. 1993 Nicolau et al. 1991

Mehra et al. 1996 Crenshaw et al. 1998 Choi et al. 2002 Zhong et al. 2002 Elgamel et al. 2002

DFG−based CDFG−based

High−Level Synthesis

Compiler−level Transformations

Gisczyc et al. 1987

Polkonjak et al. 1998

Kim et al. 1999 Shiue et al. 2000 Kumar et al. 1999 Crenshaw et al. 1998 Amellal et al. 1994 Kim et al. 1994 Gajski et al. 1992 Camposano et al. 1991 Michael et al. 1990 Wakabayashi et al. 1989 Park et al. 1988

Begamaschi 2001

Allocation/Binding Scheduling

Wang et al. 2003 Kollig et al. 1997 Lakshminarayana et al. 1997

This Work 2003 Allocation/Binding

Scheduling/

Figure 2.1.. Taxonomy of Related Works in High-Level Synthesis

(22)

The behavioral description could be applied to a simulator to verify the correctness of a new design, or to validate an existing design specification.

As the implementation technologies change, the available behavioral description could be used to redesign a circuit to make it compatible with the new technology.

Behavioral synthesis increases productivity, minimizes errors, decreases design time without any technology specific expertise from the designer.

With the increasing use of VHDL for design description, some approaches have been proposed that are specific to transformations on VHDL. Bhasker and Lee [10] proposed approaches to identify specific syntactic constructs and replace them with attributes on signals and nets to indicate their functions. In order to reduce the syntactic variation of descriptions with the same semantics, Chaiyakul et al [14] proposed a transformation technique that uses assignment decision diagrams to minimize syntactic variance in the given description.

The COMET (ClusterOriented and Minimum Execution Time) design system proposed by Chang, Rose and Walker [20] synthesizes synchronous pipeline ASICs. It uses VHDL for describ- ing the behavioral specifications. Such a description is restricted to statements with arithmetic and logical operations, control constructs like if, case and loops. They designed a subsystem, named VCOMP, for converting such a behavioral description into a DFG representation. The DFG is then subject to optimizing transformations.

Another approach for flow-graph transformations is the tree height reduction [39] that tries to improve the parallelism of the design. A similar method described in [77] uses the commutativity and distributivity properties of the language operators to decrease the height of a long expression chain, exposing the inherent parallelism in a complicated data-flow graph. Other commonly used transformations include pipelining [81], loop folding [33], software pipelining [35] and retiming [70]. Some pattern-matching transformations were applied by Rosenstiel in [90], which are based on RT semantics of the hardware components corresponding to flow-graph operators. Walker et al.

[107] applied system level transformations to divide parts of the flow graph into separate processes that run concurrently or in a pipelined fashion. Mekenkamp et al. described a system, called TRADES (Transformational DEsign System) in [72], which uses a syntax based translation to transform a subset of VHDL constructs into a CDFG on a per statement basis. Due to such a

(23)

syntax based approach, the VHDL event mechanism appears in the CDFG without imposing any guidelines on the synthesis process.

Nijhar and Brown [78] identify significant differences between the optimizations of VHDL code and that of a conventional, sequential programming language which are often assumed to be on the same line. According to them, transformations applied on sequential programming lan- guages are limited by a fixed target architecture, i.e., the architecture on which the program is to be run. VHDL optimization, however, has an extra degree of freedom associated with it in that it can manipulate the executing hardware itself.

Potkonjak et. al. [70] proposed methods for transforming a behavioral description so that synthesis of the new description requires less area overhead. They proposed a two-stage objective function for estimating the area and testability as well as for evaluating the effects of a transformation. From there, a randomized branch-and-bound steepest decent algorithm was employed to search for the best sequence of transformations.

In [91], Rosien et. al. present a method to automatically generate a CDFG from a C/C++ source code. Such a CDFG is used to automate the programming of a Field Programmable Function Array (FPFA), which is a flexible and energy efficient reconfigurable device. Their CDFG is represented using the hypergraph model, in which the operations are represented by edges (hyperedges) and the inputs and outputs are represented by the nodes which connect the edges. With such a representation, an operation can have any number of distinguishable inputs/outputs. Also, a hypergraph itself can be used as a definition of a new hyperedge and a whole hierarchical graph can be created this way. The authors have divided the process of generating such a CDFG from C/C++ code into several steps. First, a parse tree is generated from the code, from which the language constructs are converted into a list of hypergraph templates. A complete CDFG is built from these templates, which is then subjected to a series of behavior preserving transformations. Finally, a clean CDFG is obtained in which the control lines and the statespace are trimmed as much as possible.

2.2 DFG-Based Works

The works discussed in this section have used a Data Flow graph as their intermediate representation.

(24)

2.2.1 Scheduling

Paulin and Knight [85] introduced the force-directed scheduling (FDS) that uses a global selection criterion to choose the next operation for scheduling. Their FDS algorithm relied on the ASAP and ALAP scheduling algorithms to determine the range of control steps for every operation. The algorithm achieves its objective of reducing the number of Functional Units by uniformly distributing the operations of the same type into all the available control steps.

The HAL system, which is based on their force-directed scheduling approach, performs behavioral synthesis on a global scheme with step-wise refinement. Some of the constraints and features supported by the FDS algorithm include,

multicycle and chained operations.

mutually exclusive operations.

scheduling with fixed global timing constraints, aimed at minimizing functional unit costs, register costs and global interconnect requirements.

scheduling with local timing constraints.

scheduling with fixed resource constraints.

functional pipelining.

structural pipelining.

The FDS scheme does not take into account future scheduling of operations into the same control step which leads to a lack of compromises between early and late decisions, which may result in a sub-optimal solution. Park et al. [81] overcome this weakness by iteratively rescheduling some of the operations in the given schedule. An initial solution is obtained using a standard algorithm, and that solution is maximally improved by rescheduling a sequence of operations till no improvement is attainable. The COMET system [20] applies the concept of Force-directed scheduling to interacting with cluster structure information. Their system is based on a tool called the Cluster Oriented Scheduling (COS), which uses pattern matching techniques to recognize the

(25)

cluster structures of a new algorithm as an instance of a dependency structure for mapping an algorithm to architecture.

In [38], Gupta et al. present a latency-constrained scheduling algorithm to optimize a design for dynamic power. Their work is motivated by the force directed scheduling algorithm proposed by Paulin and Knight [84]. Their algorithm reduces dynamic power by reducing switched capacitance inside resources, after evaluating the switched capacitance of combinations among DFG operations that could share resources. A force is associated with each feasible combination corresponding to the power consumption, and a distribution of such forces is obtained, whose mean, standard deviation and skew are used to produce a power-optimized schedule.

Rim and Jain [89] demonstrate a performance extension tool that computes a lower-bound completion time for non-pipelined resource-constrained scheduling problem for a given data-flow graph with a set of resources and for a specified resource delay and a clock cycle. Chaudhuri and Walker [18] produced an algorithm for computing lower bounds on the number of functional units of each type required to schedule a data-flow graph in a specified number of control steps. The bounds are found by relaxing either the precedence constraints or the integrity constraints on the scheduling problem.

A list-based scheduling algorithm that uses information from a DFG to guide its search for optimal / near-optimal schedules is presented in [100]. A DFG analysis is performed initially, which includes, finding the successors and predecessors of every node and the tree to which the node belongs to. With this available knowledge, the scheduler is supposed to make a perfect choice for the operation to be scheduled next.

The most basic constructive approaches for HLS, the ASAP and ALAP algorithms, have no priority assigned to operations, while the list scheduling approaches use a global criterion for se- lecting the next operation to be scheduled. Pangrle et al. [12] used the mobility of an operation as its global priority function, where mobility is defined as the difference between the ASAP and ALAP values of that operation. Another priority function, named urgency, was used by Girczyc et al. in [33], which is defined as the minimum number of control steps from the bottom at which an operation can be scheduled before a timing constraint is violated. The list of ready operations is ordered according to these priority functions and processed for each state.

(26)

Other scheduling approaches were proposed to address the problems of memory and storage.

Kim and Liu [52] laid emphasis on minimizing the interconnection and then tried to group the variables to from memory modules. Lee and Hwang [65] proposed taking multiport memory into account as early as during scheduling. A multiport access variable (MAV) was defined for a control step, and the MAVs across all the control steps were equalized in order to achieve a better memory utilization. Aloqeely and Chen [5] proposed a sequencer-based architecture, where a sequencer is a stack or queue connecting one functional unit to another. High quality datapaths could be synthesized for many signal processing and matrix computation algorithms by letting the variables to either stay or flow through the sequencers for future use.

Achatz [1] proposed an extension to the ILP formulation so that it can handle multifunctional units as well as units with different execution times for different instances of the same operation type. Wang and Grainger [109] came up with a method to reduce the number of constraints in the original ILP formulation without reducing the explored design space, thereby, making the computation more efficient and more applicable to larger-sized problems. Chaudhuri et al.[19] described a well-designed ILP formulation for exploiting the structure of the assignment, timing, and resource constraints, and they further improved the well-structured formulation by adding new valid inequalities. Landwehr et al. proposed the OSCAR system in [63], which represents a 0/1 integer programming model for solving the three tasks of HLS. In [32], Gebotys proposed an integer programming model for the synthesis of multi-chip architecture which can simultaneously deal with partitioning, scheduling, and allocation. Wilson et al.[110] generalized the ILP approach in an integrated solution to the scheduling, allocation and binding in datapath synthesis.

Ly et al. [68] proposed a method for using behavioral templates for scheduling, where each template locks a number of operations into a relative schedule with respect to one another. It eases the handling of timing constraints, sequential operation modeling, prechaining of certain operations, and hierarchical scheduling. Unaltuna et al. presented a three-phase neural network based scheduling algorithm in [105], while, Kawaguchi and Tokada combined simulated annealing with neural networks in [50] for solving the scheduling problem.

Dhodhi et al. [28] proposed the application of a problem-space genetic algorithm for datapath synthesis, that performs concurrent scheduling and allocation with the objective of minimizing the resource cost and the total execution time. Ly et al. [69] adapted the simulated annealing procedure

(27)

to high-level synthesis that explores the design space by repeatedly ripping up parts of a design in a probabilistic manner and reconstructing them using application-specific heuristics. Sharma et al.

[96] combined the allocation and scheduling of functional, storage and interconnect units into a single phase, using the concept of register state (free, busy or undecided) for optimizing registers in a incomplete schedule where the lifetimes of variables are yet to be available.

Langevin and Cerny [64] described a recursive method for estimating a lower bound on the performance of schedules under resource constraints for acyclic finite Data-Flow graphs. The recursive method is based on the greedy lower-bound estimator of Rim and Jain [89], which was formulated as resolving a relaxation of the general scheduling problem and allowed for chaining of operations, and pipelined and mylticycle operations.

Kollig and Al-Hashimi [54] described a new simulated annealing-based algorithm capable of solving scheduling, allocation and binding tasks simultaneously without the need of independent interconnect optimization. Their algorithm begins with an initial solution, and proceeds by generating new solutions which are either accepted or rejected based on an acceptance criterion defined in the algorithm. The probability of accepting solutions with increasing cost depends ona cost parameter, which is gradually lowered as the annealing process proceeds [25] . The moves used applied in the annealing procedure are chosen in a way to cover scheduling, allocation and binding tasks simultaneously. The four different kinds of moves that could applied on a randomly chosen operation in the algorithm at any time are:

Randomly schedule the operation one control step earlier or later.

Bind the value to a new register from the set of available registers.

Bind the operation to a new functional module from a set of available modules.

Swap the inputs of the operation if it is commutative.

Zhu and Gajski [113] established a theoretical framework for another concept of scheduling called soft scheduling. In soft scheduling, the decisions made are soft, i.e., they could be adjusted later. The authors discuss the applicability of soft scheduling to alleviate the phase coupling problem of HLS.

(28)

Shantnawi et. al.[4] presented a novel technique to obtain a rate-optimal and processor-optimal schedule for a fully-static data flow graph onto a multiprocessor system. The authors employ the Floyd-Warshall’s shortest path algorithm to evaluate the relative firing times of the nodes of the DFG.

2.2.2 Allocation and Binding

Thepayasuwan et al. [102] proposed a novel technique for resource binding and operation scheduling to maximize the latency of the digital hardware such that its simultaneous switching noise is kept within feasible limits. The technique involves the automatic generation of performance models for each input specification and then applying an exploration algorithm to find the best resource binding and operation scheduling alternative.

Tseng and Siewiorek [104] divided the allocation problem into three tasks of storage, functional- unit, and interconnection allocation which are solved independently by mapping each task to the popular clique-partitioning problem of graphs. In the graph formulation, operations, values, or interconnection are represented by nodes. An edge between two nodes indicates those two nodes can share the same hardware. The allocation problem is thus transformed to the problem of finding the minimal number of cliques in the graph. Since the problem of finding the minimal number of cliques in a graph is a NP-hard problem, Tseng and Siewiorek adopted a heuristic approach to tackle it.

The clique-partitioning problem can minimize the storage requirements when applied to storage allocation. However, it totally ignores the interdependence between storage and interconnection allocation. Paulin and Knight [85] extend this approach by augmenting the graph edges with weights that reflect the impact on interconnection complexity due to register sharing among variables.

Hitchcock et al. [41] proposed a allocation system, named EMUCS, that starts with an empty datapath and builds it gradually by adding functional, storage and interconnection units as necessary. A similar approach was used in the MABAL system of Kucukacakar and Parker [59] which uses a global criterion based on functional, storage and interconnection costs to determine the next element to assign and where to assign it.

(29)

Kurdahi and Parker [60] used the left-edge algorithm to solve the register-allocation problem The left-edge algorithm has the advantage of a polynomial time complexity when compared to the clique-partitioning approach which is NP-complete. While the left-edge algorithm can successfully allocate the minimum number of registers, it fails to consider the impact of register allocation on the interconnection cost, which could be taken care by a weighted version of the clique-partitioning algorithm.

The register and functional-unit allocation problems have been transformed into weighted bipartite-matching algorithms in [43]. The authors use a polynomial time maximum weight matching algorithm that allocates a minimum number of registers and also takes, partially, into consid- eration, the impact of register allocation on interconnection allocation. Kumar and Bayoumi [3]

considerd the binding of function units operating at multiple voltages. Their work is aimed at minimizing the power consumption due to switching activities on the physical components. They transformed the problem of binding into a graph-theory problem which was later solved using two approaches: greedy approach and an optimal approach. Shiue et. al. [98] presented a novel approach to low power binding in high-level synthesis based on linear programming methods. The binding problem was mapped on to a graph called the parallel graph(PG), upon which the linear- programming techniques were applied to search all paths to find the optimal binding that minimizes the overall power consumption due to switching activity. A datapath synthesized by constructive or decomposition methods, could be further improved by an iterative refinement approach, named reallocation. Tsay et al. [103] propose the application of a sophisticated branch-and-bound method by reallocating a group of different types of entities for datapath refinement.

In [2], the authors explored the potential of precision sensitive approach for the high-level synthesis of multi-precision DFGs. They focus on fixed latency implementation of the DFGs.

They present register allocation, functional unit binding and scheduling algorithms to exploit the multi-precision nature of the DFGs for optimizing the area. An iterative improvement approach is developed with cost function being formulated in terms of number of bits of arithmetic operators and storage units.

Dasgupta and Karri [24] proposed algorithms for scheduling and binding to minimize data bus transitions. The algorithm was based on a simulated annealing process. Hong and Kim [42] proposed the repeated application of the computation of maximum flow of minimum cost in networks

(30)

for low power bus optimization during scheduling and binding. Chang and Pedram [16] proposed a new technique for reducing power consumption through register allocation and binding. The problem, in their algorithm, was formulated as a minimum cost clique covering problem, and solved for optimality using a max-cost flow algorithm. The same authors proposed an approach of power reduction in [17] for binding of functional units. The problem, in this case, was formulated as a max-cost multi-commodity flow problem and solved for optimality. Since the multi-commodity problem is NP-hard, the functional unit binding problem domain was restricted to functionally pipelined designs with shorter latency. The approaches provided in [16] and [17] have the drawback of their application being limited to a number of specific small sized low-power problems.

2.3 CDFG-Based Works

2.3.1 Scheduling

The works described in preceding sections have considered only blocks of straight-line code.

However, in addition to blocks of straight-line code, a realistic design description usually contains both conditional and loop constructs.

The problem of scheduling for control-dominated applications is considered in [111] and [58].

Several scheduling techniques based on a control flow graph (CFG) model are presented in [13]

[11] [9] [31]. The CFG is basically a graphical description of a sequential program based implementation of the functionality. Even though such a CFG model could be comfortably used to capture the execution of instructions on a general-purpose uniprocessor, its application in exploiting the parallelism inherent in typical control flow intensive applications is limited.

Kim et al. [53] have proposed some techniques to schedule conditional constructs. In [106], a conditional vector is used to identify mutually exclusive operations so that an operation can be scheduled in different control steps for different execution instances. Camposano et al. [13]

proposed a path-based approach, called the As Fast As Possible (AFAP) scheduling, which first extracts all possible execution paths from a given behavior and schedules them independently. The schedules for the different paths are then combined by resolving conflicts among the execution paths. Similarly, different approaches have been proposed for handling loop constructs, like the pipelining method described in [82] and loop folding [33].

(31)

Some graph representations that combine control and data flow into a single graph are presented in [47][29][56]. In [47], the control dependencies are expressed, but only the data dependencies are taken into account, while the control flow is not exploited. Here, the loops are represented using a branch node at the beginning of the loop and a merge node at the end. Such a branch-merge loop construct is developed for each variable in the loop, thus, resulting in a complex sub graph with a high degree of redundancy of branch and merge nodes. A similar representation is used in [29].

The Sprite Input Language described in [56] uses a single signal flow graph and is more confined to DSP applications.

Amellal and Kaminska [6] presented a control and data flow graph (CDFG) model for system representation which includes a new representation of conditional branches. They had developed a mutual exclusion testing procedure that provides for optimized resource sharing and critical path reduction. Some of the salient features of their CDFG representation include,

With the representation of the data and control flows in the same graph, both the datapath and the controller can be synthesized from that same graph.

The CDFG is an optimized representation of the control and data flows without any redundant dependency representations.

Their CDFG representation of the behavioral descriptions does not impose any restrictions on the scheduling tasks, thereby, resulting in a better exploration of the design space.

A branch numbering scheme is developed to solve the problem of resource sharing among mutually exclusive operations of the CDFG.

Apart from the single-graph model, the authors have formulated a new mathematical approach for the scheduling problem based on penalty weights. They had used the Tabu Search technique, which has been effective in finding optimal solutions for many types of large and difficult com- binatorial optimization problems. The authors claim that the fast and intelligent solution space exploration provided by the Tabu technique makes their scheduling algorithm quite powerful.

The Wavesched scheduling algorithm presented in [61] uses a CDFG model that preserves parallelism inherent in the application. This algorithm is aimed at minimizing the average execution time of control-flow intensive behavioral descriptions. With its ability to overlap the schedules

(32)

of independent iterative constructs, the bodies of which share resources, the Wavesched algorithm could explore previously unexplored regions of the solution space. A general loop handling technique was developed to incorporate other optimization techniques like loop unrolling. Also, the algorithm can support multi-cycled and pipelined functional units and can use chaining to enhance the cycle time utilization.

Potkonjak and Srivastava [86] introduced a transformation, named rephasing, to manipulate the timing parameters in a CDFG during the high-level synthesis of data-path intensive applications.

They use the rephasing approach to manipulate the values of the phases (or the relative times when corresponding samples are available at input and delay nodes) as an algorithmic transformation before the scheduling/allocation stage. They have shown that phase values can be chosen to transform and optimize the algorithms for factors like area, throughput, latency and power. In effect, the authors presented a technique for behavioral optimization through the manipulation of timing constraints.

Sentieys et al. [94] presented an architectural synthesis tool dedicated to DSP applications, in which, synthesis is achieved under time as well as silicon cost constraints. The algorithm is described in VHDL behavioral language, from which a CDFG is obtained and synthesized into processing, control, memory and communication units. The specifications of the designs in VHDL allowed for the interconnection with CAD and simulation tools. Kim et al. present a verification method for VHDL behavioral level design in [51]. To identify coding errors that a compiler cannot detect, the VHDL code is converted into a CDFG and verification patterns are applied on the CDFG.

They have also proposed other algorithms like backward training and forward training algorithm to actuate coding error and propagate it.

In an attempt to bridge the gaps between high-level and logic synthesis, Bergamaschi [8] presented a novel internal model, called the Behavioral Network Graph (BNG), that represents both data and control constructs, for synthesis that covers the domains of both high-level and logic synthesis. This model is an RT-level network capable of representing all possible schedules that a given behavior may assume. It allows high-level synthesis algorithms to be formulated as logic transformations and effectively overlapped with logic synthesis. The author has also addressed the problem of a lack of formal representation to be used by different algorithms and systems, which makes the sharing of benchmark examples difficult.

(33)

Kollig et al. [54] described a new simulated annealing-based algorithm capable of solving all HLS tasks concurrently without the need of independent interconnect optimization. The HLS problem is formulated with a schedule time a module binding for each operation, a register binding for generated values and a Boolean variable indicating whether the inputs of commutative operations are to be swapped. This formulation is subject to constraints including data dependencies, execution time and module availability. Starting with an initial solution, new solutions are generated and either accepted or rejected depending on the acceptance criterion defined in the simulated annealing algorithm. The probability of accepting solutions with increasing cost depends on a control parameter, which is gradually decreased while the annealing process proceeds.

In GEM (Geometric Algorithm for Scheduling), proposed by Raje and Sarrafzadeh [88], a critical path based approach is used for scheduling operations. The algorithm uses weighted geometric point dominance matching from the operations onto the control steps. It starts with a CDFG and converts it into directed acyclic graphs by breaking the loops and removing the feedback edges while maintaining the conditions for the loops. Such an algorithm is performed in O(nplogp) steps, where n is the number of disjoint paths in the CDFG and p the number of nodes in the longest path. The problem of scheduling a path is transformed into the problem of obtaining a matching from a set of points OPi’s, which represent the various operations in a path, onto a set of points Ci’s, which represent the control steps. Certain constraints are imposed onto the matching owing to the data dependencies of the operations to be matched. The dependencies are usually described as precedence relations.

Wang et al. [108] described a comprehensive high-level synthesis system for control-flow intensive as well as data-dominated behaviors. Their algorithm is based on an iterative improvement strategy and performs clock selection, scheduling, module selection, resource allocation and assignment simultaneously, and also consider the interactions between these tasks to benefit com- pletely from the design space exploration at behavior level. Their scheduling algorithm supports concurrent loop optimization and multicycling under resource constraints. The authors use a variation of the general CDFG model for their synthesis system, where their CDFG model includes some additional nodes to represent the start of a loop, etc. However, their system assumes that the initial CDFG representation of the application is available.

(34)

2.3.2 Allocation/Binding

In [30], Elgamel et al. present a novel approach for utilizing genetic algorithms to solve High- level synthesis tasks with multiple voltages. They have incorporated a new way of modeling and encoding the resulting chromosomes. Their system, takes as its inputs, a CDFG, hardware library, and the time, area constraints. With this information, the algorithm solves the tasks of scheduling, allocation and binding simultaneously in order to generate a solution optimized for average and peak power. The evaluation function is formulated so as to consider the average and peak power consumptions while satisfying the given constraints.

Choi and Kim [21] proposed an efficient binding algorithm for power optimization in high- level synthesis. The authors claim that the traditional approach of formulating binding problem as a multi-commodity flow problem is limited to a class of small sized problems owing to the NP-hard nature of the multi-commodity flow problem. They have developed a new technique that uses the property of efficient flow computations in a network so that it is extensively applicable to practical designs while producing near-optimal results. They propose a heuristic algorithm, named BIND-lp, that finds a feasible binding by utilizing the flow computation steps and later refining them incrementally.

The application of increased parallelism to allow voltage reduction for the same computational throughput is depicted in [15]. This work led to several other works like [74], which uses slack to avoid unnecessary computations, and [5] which shows how a DFG might be partitioned for multiple voltages. In [7], the voltage idea was combined with an iterative improvement approach using a square switching matrix as the basis for a signal correlation matrix.

Crenshaw et al. proposed several heuristics in [22] to investigate the problem of exploiting signal correlation between operations to find a schedule and binding which minimizes switching.

They describe an algorithm for scheduling communications on a bus, which reduces bus switching upto 60% without any increase in the number of cycles required for the schedule. Their technique of capturing signal correlation information during behavioral simulation can be applied in addition to popular voltage reduction methods. The switching information, thus obtained through simulations, is stored in the from of a switching table. A cubic table represents the switching activity for conditional nodes by including data for switching from a node^K ^@ to a node ^K ^@ , where j 1. In

(35)

the absence of conditional nodes, all the data represents switching between node^K and ^K ¹ , for all^I. Thus, their method could be applied to graphs with both conditional and data nodes.

Mehra et al. proposed the partitioning of CDFG into groups with minimized inter-group communication in order to reduce the switched capacitance.

Zhong et al. [112] presented a general sufficient condition for register binding to ensure that a given set of functional units is perfectly power managed, i.e., does not contain any spurious switching activity. Their method is applicable to both data-flow intensive and control-flow intensive behaviors and leads to a straightforward power-managed register binding algorithm. The authors claim that their algorithm, begin independent of the functional unit binding and scheduling steps, could be easily incorporated into existing high-level synthesis systems. In [27], a technique is proposed to redesign the control logic to configure existing multiplexer networks to minimize spurious switching activity. A register binding algorithm which guarantees an RTL circuit for control and data-flow intensive behaviors, which is free of spurious switching activity, is presented in [62].

2.3.3 Our Work

The work presented here describes a High-Level Synthesis tool that could be used to solve each of the aforementioned tasks. This tool is targeted at both control-flow intensive and data-flow intensive behaviors, and incorporates several additional features like optimization for resources and power consumption. Our system uses a single graph representation of control and data flow dependencies as its intermediate form, and also incorporates a tool for transforming the original VHDL description into the corresponding CDFG.

(36)

CHAPTER 3

CDFG EXTRACTION FROM VHDL

The first, and often ignored, step in the high-level synthesis (HLS) process is the conversion of the original behavioral description in VHDL into an intermediate representation that captures the details of the description in a form suitable for the next steps of HLS. Some of the most commonly used intermediate representations include, the Data Flow Graph (DFG), the Control Flow Graph (CFG), and the Control and Data Flow Graph (CDFG). While DFG is the most prominent form of representation, the CDFG has the advantage of depicting both control and data constructs in a single graph, giving a better state-space exploration capability for the later steps. In this chapter, we describe a conversion tool that extracts such a CDFG from a given behavioral VHDL code.

This tool is based on compiler-like transformations and other behavior-preserving transformations.

3.1 Introduction

Today, Very High Speed Integrated Circuit(VHSIC) Hardware Description Language (VHDL) has emerged as the default hardware description language. VHDL is intended to describe a hardware. This description can be fed to a simulator, which simulates the behavior of hardware modeled in the VHDL description. If the description is correct, the simulation exhibits the same behavior as the hardware.

Initially, VHDL was intended for use as an input to the VHDL simulator and not for synthesis.

The simulation of the code was done to accurately predict the voltage values on all nets at any time and verify timing relationships between changes on these nets. On the contrary, the goal of the synthesis is to implement the given behavior by interconnecting components from a given library.

Hence, simulation deals with timing while synthesis deals with connectivity . For a simulation to generate the correct models for the behavior of nets of interest, the description driving those nets

(37)

need not be minimal as long as it produces the correct behavior. Therefore, different descriptions for the same functionality are likely to synthesize into designs of varying quality.

VHDL models the hardware as concurrently running processes. Each process contains an algorithmic description of the process’ behavior. Thus, modeling hardware in a single process results in a purely behavioral description at the highest level of abstraction. However, as soon as a VHDL description contains more than one process, it defines some structure within the hardware. This structure implicitly adds to the model’s behavior. It also lowers the level of abstraction, which can come close to the hardware if each process only has the functionality of a gate. VHDL modeling is analogous to the concept of a container. The values within these containers change with time as they are affected by varied input stimuli. Such a model is suitable for both synchronous and asynchronous behavior.

The process of synthesizing an RTL structure from the functional description during the high- level synthesis involves three phases: the timing of computational operations (scheduling) deter- mining the number of instances of each resource needed (allocation), and the assignment of re- sources to computational operations (binding). Behavioral synthesis requires transformation of the VHDL code into a internal representation which extracts registers, combinational logic equations and macros like ’+’, ’*’, etc., for the scheduling, allocation and binding processes. Most systems use something like the control flow graph and/or the data flow graph or the combination of the two like CDFG as their intermediate format.

A CDFG is expected to capture all the control and data flow information of the original VHDL description while preserving the various dependencies [93]. This CDFG undergoes incremental refinement as it passes through various stages of a high-level synthesis system to finally yield a register transfer level representation of the behavioral specification. Scheduling partitions the set of arithmetic and logical operations in the CDFG into groups of operations so that the operations in the same group can be executed concurrently, while trying to minimize the total execution time and/or the hardware cost.

Researchers working on the high-level synthesis problem begin with the assumption that a flow-graph representation of the behavioral description is available. Here, an often over-looked significant step is the conversion of the VHDL code into a CDFG. This step, normally carried out manually prior to the high-level synthesis, may take few minutes to few days based on the size and

(38)

the complexity of the VHDL code. Also, the chances for errors in the resulting CDFG increase with the increasing complexity of the code, and being one of the initial steps, this would drastically affect the accuracy of the whole design-flow. To counter this problem, we have developed a new conversion tool that automates the process of obtaining CDFG from a VHDL code, thereby reducing the design time significantly and providing a useful aid to the developer. The task of such a VHDL to CDFG compiler would be to extract a fully behavioral description which would be fed to an architectural synthesis system that will add structural information to that description.

3.2 Preliminaries

In this section, we provide the basic definitions and concepts which from the basis for our conversion tool.

3.2.1 Control and Data Flow Graph

The Control and Data Flow Graph is a directed acyclic graph in which a node can be either an operation node or a control node ( representing a branch, loop, etc.) [92]. The directed edges in a CDFG represent the transfer of a value or control from one node to another. An edge can be conditional representing a condition while implementing the if/case statements or loop constructs.

Figure 3.1. shows a CDFG representation for the following VHDL code fragment.

A := B * C + D;

while( A 0 ) loop

A := A - 1;

end loop;

Nodes

In general, the nodes in a CDFG can be classified as one of the following types [92].

Operational nodes: These are responsible for arithmetic, logical or relational operations.

Call nodes: These nodes denote calls to subprogram modules.