Interval Arithmetic Logic Unit for DSP and Control Applications

(1)

GUPTE, RUCHIR. Interval Arithmetic Logic Unit for DSP and Control

Applica-tions. (Under the direction of Prof. William W. Edmonson).

There are many applications in the field of digital signal processing (DSP) and

controls that require the user to know how various numerical errors (uncertainty)

affect the result. Interval Arithmetic (IA) eliminates this uncertainty by replacing

non-interval values with intervals. Since most DSPs operate in real time

environ-ments, fast processors are needed. The goal is to develop a platform in which

interval arithmetic operations are performed at the same computational speed as

present day signal processors.

This thesis proposes a design for an interval based arithmetic logic unit (I-ALU)

whose computational time for implementing interval arithmetic operations is

equiv-alent to many digital signal processors. Many DSP and control applications require

a small subset of arithmetic operations that must be computed efficiently. This

de-sign has two independent modules operating in parallel to calculate the lower bound

and upper bound of the output interval. The functional unit of the ALU performs

the basic fixed-point interval arithmetic operations of addition, subtraction,

multi-plication and the interval set operations of union and intersection. In addition, the

ALU is optimized to perform dot products through the multiply-accumulate

instruc-tion. Division is not implemented on digital signal processors traditionally unless

computed with a shift operation. In this design, division by shifting is implemented.

One of the prime design goals is to maximize the throughput of the ALU for

an optimum value of area. Pipelining is implemented to achieve this design goal.

Power dissipation analysis of different ALU architectures is done. Since it required

to obtain maximum throughput for the least power dissipation, throughput per

unit power dissipation is used as the most critical performance metric. This thesis

studies several architectures for the ALU and concludes with the one with the

(2)

by

Ruchir Gupte

A thesis submitted to the Graduate Faculty of North Carolina State University

in partial fulfillment of the requirements for the Degree of

Master of Science

Electrical and Computer Engineering

Raleigh

2006

Approved By:

Dr. Winser E. Alexander Dr. William Rhett Davis

Dr. William W. Edmonson

(3)

(4)

Ruchir Gupte was born on December 9th_{, 1982 in Mumbai, India. He received}

his Bachelor of Engineering (B.E.) degree in Electronics and Telecommunications

from the Mumbai University in June 2004. In Fall 2004, he began his graduate

studies in the Electrical and Computer Engineering Department at North Carolina

State University. Since Spring 2005, he has been working in the High Performance

Digital Signal Processing (HiPerDSP) Laboratory of Dr. Winser Alexander and Dr.

William Edmonson in the field of hardware support for interval analysis.

He worked at Sony Ericsson Mobile Communications Inc., Raleigh, as an intern

from May 2005 to August 2005. He has also taken keen interest in community

participation and has been a committee member of the NC State Indian Graduate

Student Association called MAITRI. Moreover, he has extended his support in

volu-teering for various events organized by the Department of Electrical and Computer

(5)

Above all, I thank my parents for the much needed motivation throughout the

duration of my stay away from home. It was their love and support that helped

me maintain sanity during stressful times. My sister, Sheetal, has been a great

inspiration for me throughout.

I sincerely acknowledge the efforts of Dr. William Edmonson, my academic

advi-sor, in providing guidance and encouragement for the successful completion of this

thesis. Dr. Edmonson has made available all resources that I could possibly need

and also allowed the independence of applying my ideas in this project. I am deeply

indebted to him for his patience and invaluable suggestions during the course of this

project.

I am also grateful to the other members of my thesis committee, Dr. Winser

Alexander and Dr. Rhett Davis for devoting their time and providing useful inputs.

Completion of this work would not have been possible without their guidance.

I sincerely wish to express my gratitude to the HiPer DSP Research group for

creating an environment that has been fabulous for research and fun. Additional

thanks to Ramsey Hourani and Senanu Ocloo for their unconditional help

through-out my stay in the group. The encouragement and moral support extended by all

members of the group through good and hard times cannot be described in words.

Special thanks to Ravi Jenkal for his inputs and help.

Finally, I would like to thank those near and dear to me, without whose backing,

this thesis would have been a distant reality. I am grateful to all my friends at

Raleigh for being there, a special mention to my roommate, Karan Tewari, for his

(6)

List of Tables vii

List of Figures viii

1 Introduction 1

1.1 Motivation . . . 4

1.2 Background . . . 5

1.3 Contribution. . . 7

1.4 Thesis Organization. . . 8

2 Interval Arithmetic 9 2.1 Interval Arithmetic and Set Operations . . . 10

3 Design Specifications of the ALU 16 3.1 Fixed-point two’s complement arithmetic . . . 16

3.1.1 Representation of numbers . . . 17

3.1.2 Arithmetic Operations . . . 18

3.2 Outward/Directed Rounding . . . 21

4 Hardware Architecture 25 4.1 Overall Architecture . . . 25

4.2 Flag Generator Module . . . 27

4.3 Lower Bound and Upper Bound Modules . . . 28

4.3.1 Functional Units and Control Logic . . . 31

4.3.2 Special Case Multiplication Block . . . 32

4.3.3 Multiply-Accumulate Block . . . 34

4.4 Rounding Unit . . . 35

4.5 Pipeline Architecture of the Design . . . 38

4.5.1 Need for Pipelining . . . 38

(7)

4.5.3 Highly Pipelined Design . . . 41

5 Testing and Results 44 5.1 Simulation Results . . . 45

5.2 Synthesis Results . . . 45

5.2.1 Non-Pipelined Architecture . . . 47

5.2.2 Design with Pipeline Multipliers . . . 49

5.2.3 Highly Pipelined Design . . . 54

5.3 Power Analysis . . . 57

5.3.1 Generating Input Vectors . . . 59

5.3.2 Statistical Results from Power Scripts. . . 60

6 Conclusions and Future Work 66 6.1 Conclusions . . . 66

6.2 Future Work . . . 68

(8)

2.1 Nine Cases in Multiplication . . . 12

3.1 Two’s complement fixed-point representation . . . 18

4.1 Description of ALU Inputs . . . 27

4.2 Description of ALU outputs . . . 27

4.3 mul flag for the Multiplication operation . . . 28

4.4 Flag Generation . . . 29

5.1 Timing Reports for the Non-Pipelined Architecture . . . 47

5.2 Area Reports of Non-Pipelined Architecture . . . 49

5.3 Timing Reports for the Pipelined Multipliers . . . 51

5.4 Timing Reports for various Pipelined Architectures . . . 51

5.5 Area Reports for various Pipelined Architectures. . . 52

5.6 Timing Reports for Non-Pipelined and Highly-Pipelined Architectures 56 5.7 Results for Non-Pipelined and Highly-Pipelined Architectures . . . 56

5.8 Area Reports for Non-Pipelined and Highly-Pipelined Architectures 56 5.9 Power Dissipation for Different Architectures with 500 Input Vectors 60 5.10 Power Dissipation for 3-stage Pipelined Architecture. . . 62

5.11 Power Dissipation for All Stages with different Input Vectors . . . . 63

5.12 Throughput per unit Power Dissipation for All Architectures . . . . 63

(9)

3.1 Two’s complement number representation . . . 17

4.1 Top Level Block Diagram of the ALU . . . 26

4.2 Flag Generation Module . . . 29

4.3 Block Diagram of Lower Bound Module. . . 30

4.4 Block Diagram of Upper Bound Module . . . 30

4.5 Lower Bound Module . . . 32

4.6 Upper Bound Module. . . 33

4.7 Special Case Multiplication . . . 34

4.8 Multiply-Accumulate Module . . . 35

4.9 Lower Bound Rounding . . . 36

4.10 Upper Bound Rounding . . . 37

4.11 Critical Path . . . 39

4.12 Non-Pipelined Multiplier Architecture . . . 40

4.13 Two-stage Pipelined Multiplier Architecture . . . 40

4.14 Three-stage Pipelined Multiplier Architecture . . . 40

4.15 Four-stage Pipelined Multiplier Architecture . . . 41

4.16 Five-stage Pipelined Multiplier Architecture . . . 41

4.17 Highly Pipelined Architecture . . . 42

5.1 Simulation Results for Add, Subtract and Multiply . . . 46

5.2 Simulation Results for Interval Union and Intersection . . . 46

5.3 Timing Reports of Non-Pipelined Architecture . . . 48

5.4 Area Reports for the Non-Pipelined Architecture . . . 50

5.5 Timing Reports of Different Pipelined Architecture . . . 53

5.6 Area Reports of Different Pipelined Architecture. . . 54

5.7 Timing and Area Reports of Different Pipelined Architectures . . . 55

(10)

5.11 Power Dissipation for different Input Vectors for 3-stage Pipelined

Multiplier Design . . . 62

5.12 Power Dissipation for different Input Vectors for All Architectures . 64

(11)

Introduction

Interval Arithmetic (IA) performs computations on intervals of real numbers

instead of real numbers themselves. It takes into account the numerical errors that

occur due to performing arithmetic on a computer. This is a problem that occurs

on all computers that make use of binary number systems, such as the IEEE 754

Standard for Binary Floating-Point Number Systems [1]. As of now,

implementa-tion of interval arithmetic is performed in software. The GNU Fortran Compiler

has been modified to provide support for an interval data type [2], based on the

Interval Arithmetic Specification [3]. The SUN Studio Fortran95 compiler provides

support for interval operations [4]. The SUN studio has compilers and tools that

support C and C++ development as well [5]. The main disadvantage of software

implementation is the slow speed. They incur tremendous overhead due to reasons

such as changing of rounding modes, function calls, exception handling, memory

management et al. For instance, the operation of multiplication requires several

conditional branches to determine which interval end-points need to be multiplied.

Based on the values of the input intervals relative to zero, nine different cases of

multiplication have to be accounted for to select the end-points. A large number

of conditional statements are required to select between these multiplication cases.

(12)

time consuming steps. The performance penalty to be paid for misprediction of

con-ditional branches is quite heavy in the case of fully pipelined processors. Changing

of rounding modes in software also requires a large number of computational cycles.

On many processors, changing the rounding mode causes the entire floating-point

pipeline to be flushed, which results in a delay of several cycles and severely limits

parallel execution. Furthermore, software implementations of interval

multiplica-tion are typically implemented as subroutines, which adds overhead for subroutine

calls and returns.

Thus, interval algorithms end up running slower on current computer

architec-tures compared to their real arithmetic counterparts [6]. Software implementations

are found to be as much as four times slower than functionally equivalent hardware

[7]. Hardware support is required to overcome these performance drops caused by

the above software issues.

Applications of digital signal processing (DSP) involve a very large number of

arithmetic operations, and the necessity of obtaining accurate results makes it

im-perative to perform reliable numerical computations. The goal of this design is to

solve problems in the DSP field with higher accuracy and at a faster rate. Since

software implementations are slow, it is necessary to build dedicated hardware in

or-der to achieve this goal. Interval methods form one of the solutions to reduce errors

resulting from numerical computations. This was the motivation behind building

an Interval ALU dedicated to DSP and Control applications.

Interval based algorithms continue to find applications as the solution for signal

processing and controls problems. For instance, in signal processing, there is usually

the need to determine the optimal solution to a problem, i.e., to minimize a cost

function. The ability of interval global optimization approaches to guarantee

con-vergence to global minimum point(s) is one that makes such approaches attractive

in DSP and control applications. Having optimum hardware for global

(13)

mentioned earlier. DSP and control algorithms need to be designed in such a way

that roundoff and truncation errors that occur naturally due to the discrete nature

of computing do not prevent the algorithm from converging to the global minimum.

Interval analysis provides a means of managing such errors. It is therefore possible

to obtain numerically accurate and reliable results. Reliable results may be defined

as the solutions in which the obtained value is guaranteed to be the exact result

of the operation being performed. Interval analysis gives an interval as the output

which certainly contains the exact value of the result expected from an operation

performed on two input interval numbers. It is thus capable of providing reliable

results.

These results can be achieved by using arithmetic logic units (ALU) that are

specially designed to manipulate interval numbers. Such an Interval ALU (I-ALU)

can be used as the core of a digital signal processor. In contrast to general

pur-pose microprocessors that are designed to handle general computing tasks, digital

signal processors are designed and optimized to operate on algorithms that are

characterized by repetitive multiply-and-add operations. In general, they feature

fast multiply-accumulate instructions, multiple-access memory, specialized program

control for interrupt handling and I/O, and fast and efficient access to peripherals.

We desire to achieve maximum efficiency while providing these features that make

up a good digital signal processor.

Throughput is the most important metric to analyze the performance of a DSP

system. The throughput problem will have to be solved for interval algorithms to

become more practical. The throughput of an I-ALU will have to be comparable to

that of non-interval units. Pipelining provides an effective solution to improving the

throughput of the design. By definition, pipelining is an implementation technique

where multiple instructions are overlapped in execution. Each stage completes a

part of an instruction in parallel. Pipelining does not decrease the time for individual

instruction execution. It increases instruction throughput, instead. Throughput of

(14)

depth of the pipeline adversely affects the area of the design. Hence an optimum

design would involve a proper trade-off between the throughput and area, where

the throughput would have more importance in signal processing applications.

1.1 Motivation

Digital Signal Processing has become the choice for many applications related

to communications, control, multimedia, et al. because of the high performance it

achieves for applications that involve limited instruction set for implementing

repet-itive linear operations such as addition, multiplication, delay, et al. on a stream of

sampled data. Often a DSP has been used as an attached coprocessor or combined

with one or more FPGA devices to meet the performance and cost requirements

for a particular application. Common to DSPs is the ability to perform

multiply-accumulate (MAC) operations in a single instruction cycle. This operation is key to

performing vector products which is key to computing fourier transforms and

corre-lation. Other features of a DSP include the ability for accessing multiple memories,

dedicated address units for simultaneous access to data memories and program

mem-ory modulo addressing. Several DSP manufacturers also include specialized

periph-erals along with fast interrupt handling. Examples of these specialized periphperiph-erals

include analog-digital converters and I/O for multiprocessor communications.

Underlying many of these applications is the need for accurate and reliable

re-sults, but errors due to rounding, uncertainty of the data, quantization noise and

catastrophic cancelation in floating point computations can lead to inaccuracies.

Sometimes these inaccuracies can go unnoticed. For many applications in signal

processing, operations are recursive and act on a sequence of data. This implies

that numerical errors can grow unbounded over time. An efficient method for

mon-itoring and controlling these inaccuracies is to replace point arithmetic with interval

(15)

Digital signal processors represent one of the fastest growing segments of the

embedded world. Despite their vast use, DSPs present difficult challenges for

pro-grammers. Since computation speed is of critical importance to DSP applications,

DSPs focus on supporting fixed-point operations.

Use of fixed-point representation not only requires the programmers to deal with

mathematically sophisticated equipment, but also are required to deal with errors

that are introduced due the use of reduced-precision arithmetic. Although it would

be ideal to use floating-point arithmetic over fixed point, as a practical

considera-tion, fixed-point processors operate at a much faster rate than their floating-point

counterparts. Fixed-point DSPs execute at gigahertz range; floating-point DSPs

peak out in the 300-400 megahertz range. Fixed-Point DSPs enjoy another

advan-tage of being consumed in large volumes as a result of which their price per chip is a

fraction of the price of a floating point DSPs [8]. Fixed-point processors gain speed

and power efficiency over floating-point processors at the cost of reduced precision.

However, DSP applications rarely require the full dynamic range offered by

floating-point number system. This justifies the choice of using fixed floating-point arithmetic for

our ALU design over floating point arithmetic.

As mentioned earlier, hardware solutions are needed over software implementation

to solve the speed problem. This brings up the idea of building an Arithmetic Logic

Unit dedicated to perform arithmetic operations on interval inputs in the fixed

point representation. The aim is to develop hardware that is optimized for interval

operations pertinent to signal processing applications such as addition, subtraction,

multiplication and multiply-accumulate.

1.2 Background

Interval algorithms have found their usage in applications such as global

(16)

solving differential equations [14], solving non-linear equations [15]et al. In most of

these cases, interval arithmetic is used to solve problems which cannot be efficiently

solved using conventional floating-point arithmetic.

Several software tools have been developed to support interval arithmetic. These

include interval arithmetic libraries in Fortran [16], [17], [18], C++ [19], extended

scientific programming languages such as PASCAL [20], C++ [21], Fortran [22],

[23] and interval-enhanced compilers [24]. Inspite of these developments in the field,

interval arithmetic has not gained popularity owing to the speed issues when

com-pared to conventional floating point methods. It is believed that the performance

of interval arithmetic needs to be within a factor of five of floating-point arithmetic

for it to gain general acceptance [25]. Hardware support for interval arithmetic is

required to achieve this high performance. Several interval based hardware designs

have been implemented, a few of which have been listed below:

• Hardware Interval Multipliers [26]

The author presents serial and parallel hardware units for interval

multiplica-tion. These units provide automatic interval end-point selection and correct

rounding of results. While the serial interval multiplier uses a single multiplier

unit, the parallel multiplier uses dual multipliers to compute the interval

end-points simultaneously. These multipliers provide a significant performance

boost for acceptable increases in area.

• A Combined Interval and Floating Point Multiplier [27]

This design is based on the approach that an interval multiplier can share

hardware with a existing floating point multiplier, thereby achieving the

per-formance benefits of a interval multiplier at relatively low costs. The design

resorts to software solutions to solve the uncommon case of multiplication

where both end-points contain zero. Interval multiplication requires only one

more cycle than floating point multiplication, and is one to two orders of

(17)

• A Combined Interval and Floating Point Divider [28]

This design follows a similar approach as above, where an existing floating

point divider is modified to enable interval division on it. Based on the values

of interval inputs relative to zero, seven different cases of division are

ad-dressed. Interval division can be performed after modifying the floating point

divider with a 24% increase in area.

• A Combined Interval and Floating Point Comparator [29]

This design is an implementation of a combined interval and floating-point

comparator, which performs interval intersection, hull, minimum, maximum

and comparisons, as well as floating-point minimum, maximum and

compar-isons. It has around 98% more area than conventional floating-point

compara-tors and a worst case delay that is 42% greater.

• Variable Precision Interval Arithmetic Processors [30]

The author presents designs, arithmetic algorithms and software support for

a family of variable precision, interval arithmetic processors. These processors

give the programmer the ability to detect, and if desired, to correct the implicit

errors in finite precision numerical computations. The processors are two to

three orders of magnitude faster than software packages that provide similar

functionality.

However, all of the above architectures are designed for floating-point

represen-tation of numbers. Although these are high-precision compurepresen-tational units, they

have lower throughput than their potential fixed-point counterparts. As mentioned

earlier, they also have higher design complexity and hence are undesirable for DSP

applications.

1.3 Contribution

The following thesis designs the hardware architecture of an I-ALU and optimizes

(18)

two’s complement representation of numbers. Two’s complement representation

is most convenient to perform arithmetic because of its uniformity over positive

and negative numbers while performing operations and rounding. Although

fixed-point arithmetic reduces the precision of results, the precision provided by it is

sufficient for DSP related applications. Besides, it has the added advantage of

reduced complexity of the design and higher throughput. A basic hardware model

has been built at the RTL level of abstraction, and the design has been modified

for better efficiency by use of pipelined multipliers of increasing depths. These

designs have been explored and statistical data, based on the results of simulations

and synthesis, has been used to determine the most optimal solution. Throughput,

area, power dissipation and numerical reliability are the performance metrics used

for system evaluation.

1.4 Thesis Organization

Chapter 2 introduces the reader to the concept and conventional representation

of Interval Numbers. Various arithmetic and set operations which can be performed

on these numbers by the I-ALU are discussed in detail in this chapter. Chapter 3

provides the design specifications of the proposed ALU. The significance of using

two’s complement representation of numbers can be seen here along with the details

of the rounding issue. Chapter 4 describes in detail, the hardware architecture of

the ALU. A comprehensive description of each module constituting the ALU has

been given here. The issue of rounding has been addressed. Chapter 5 provides

the results of simulation runs and synthesis performed on different architectures of

the design. An exhaustive comparison of the results from the non-pipelined design

and various versions of the pipelined design has been done to arrive at an optimal

solution. Throughput, area and power dissipation are used as the performance

met-rics. Finally, I conclude my work with Chapter 6 providing the details of the future

work that can be done on the design to broaden its scope, improve functionality

(19)

Interval Arithmetic

In the words of Ramon E. Moore [31], “If we have, in addition to the results of

a computation, error bounds for the differences between the results and the exact

solution values, then no matter how these error values were obtained, by analytical

means or by further machine computations during or after the given computation,

it will always be the case that we have, in effect, for each exact result sought, a pair

of numbers: an approximate value and an error bound, or an upper and a lower

bound to the exact result.”

Real numbers can be infinite precision. All machines are inherently finite

preci-sion. Owing to this nature, real numbers are approximated to get them to machine

representable forms. This error bound may be considered as anuncertainty. Interval

Analysis is a means of representing this uncertainty by replacing single (fixed-point)

values by intervals. It provides a means of bounding the errors that accrue due to

the discrete nature of computing.

An interval number is defined to be an ordered pair of real numbers [a, b], such

that a ≤ b. Using the notation{x|P(x)}for “the set of x such that the proposition

(20)

[a, b] = {x|a ≤ x ≤ b } where x∈ <

Using this convention, real numbers are represented as intervals with identical

up-per and lower bounds. Such intervals are called “Degenerate Intervals” and appear

to have the form [a, a]. The usual operations of addition, subtraction,

multiplica-tion, and division that are possible with real numbers are also defined for interval

numbers [32].

An interval number is also a set of real numbers. The interval number [a, b]

is a set of real numbers x such that a ≤ x ≤ b. Hence, set operations of union

and intersection can also be done on interval numbers. Section 2.1 discusses in

depth, the various arithmetic and set operations performed by the proposed ALU.

In interval arithmetic, the true result is guaranteed to lie within the resulting

in-terval. This is achieved by the Outward Rounding algorithm. Outward rounding

on an interval X = [a, b] is achieved by rounding the lower bound,a, to the largest

machine-representable number smaller thana, and the upper boundb, to the

small-est machine-representable number larger than b. This involves the use of theround

down and round up modes on the lower and upper bounds, respectively. Directed

Rounding capabilities, that is, the ability to round down or round up has been

available since the Intel 8087 chip. As a result, interval arithmetic is possible on

virtually any computer.

2.1 Interval Arithmetic and Set Operations

As described in the previous section, interval numbers are represented by an

ordered pair [a, b] such that a ≤b. The arithmetic interval operations of addition,

subtraction and multiplication will be discussed in this section. The rules for

(21)

the set operations of union and intersection along with the calculation of width and

mid-point of a single interval and also described.

In the following discussion, we consider two input interval numbers. They are

represented as [xL, xU] and [yL, yU], where xL and yL are the lower bounds and xU

and yU are the upper bounds of the two intervals. Except for one special case of set

union of two disjoint sets, all operations result in a single output interval number.

The outputs of various interval operations are obtained as follows:

• Addition

Addition of interval numbers is a straightforward operation where the lower

bound of the output interval is obtained from the sum of the lower bounds of

the input intervals, while the upper bound of the output interval is obtained

from the sum of the upper bounds of the input intervals.

Mathematically, this can be represented as:

[xL, xU] + [yL, yU] = [xL+yL, xU +yU]

• Subtraction

In subtraction, lower bound of the output interval is obtained by subtracting

the upper bound of one interval number from the lower bound of the other

interval number. Similarly, upper bound of the output interval is obtained

by subtracting the lower bound of the second interval number from the upper

bound of the first interval number.

(22)

Table 2.1: Nine Cases in Multiplication

Case Condition Result

1 xL ≥0;yL≥0 [xLyL,xUyU]

2 xL ≥0;yL<0< yU [xUyL,xUyU]

3 xL ≥0;yU ≤0 [xUyL,xLyU]

4 xL <0< xU;yL≥0 [xLyU,xUyU]

5 xL <0< xU;yU ≤0 [xUyL,xLyL]

6 xU ≤0;yL≥0 [xLyU,xUyL]

7 xU ≤0;yL<0< yU [xLyU,xLyL]

8 xU ≤0;yU ≤0 [xUyU,xLyU]

9 xL <0< xU;yL<0< yU [min(xUyL,xLyU),

max(xLyL,xUyU)]

• Multiplication

Multiplication presents a more difficult problem than addition and

subtrac-tion. Unlike these two operations, apart from the magnitude, the sign of the

operands also needs to be taken into consideration. Both, sign and

magni-tude of operands decide which two values are to be multiplied to obtain the

lower and upper bounds separately. Under normal circumstances, the result

of multiplication of two input intervals would be obtained as follows:

If [xL, xU] ∗ [yL, yU] = [zL, zU], then,

zL = min(xLyL, xLyU, xUyL, xUyU) and

zU = max(xLyL, xLyU, xUyL, xUyU)

These computations require eight multiplications and several comparisons

to be performed before the lower and upper bounds of the intervals can be

obtained. This makes the multiplication operation highly inefficient. To

over-come this problem, the multiplication operation is split into 9 different cases

based on the values of the operands with respect to zero. Table2.1 lists these

(23)

From this table, it can be observed that the task of obtaining the lower

bound and the upper bound of the output interval is reduced to two

multipli-cations compared to the eight multiplimultipli-cations that were required when a brute

force method was followed. Comparisons of the input values need to be done

initially to determine which case they belong to. Reduction in the number of

multiplications required to be done to determine the output helps in making

the design more hardware efficient.

A special mention needs to be made of case 9, where both the input intervals

include zero in them. Although this would be a rare case in high resolution

processors, it needs to be addressed for the purpose of numerical reliability. As

can be seen from the table, the output for this case requires 4 multiplications

and 2 comparisons to be performed. This case leads to increased complexity

in the design and from the hardware point of view requires double the amount

of computational time as compared to other operations.

• Union of Interval Numbers

Union of interval numbers is done in the same way as the union operation

in set theory. By definition, for two sets A and B, (A ∪ B) is defined as a

set containing all elements of set A and all elements of set B. Similarly for

interval numbers, to perform the union operation, the lower bound is obtained

by determining the minimum value of the lower bounds of the two input

intervals while the upper bound is obtained by determining the maximum

value of the upper bounds of the input intervals.

[xL, xU]∪[yL, yU] = [min(xL, yL), max(xU, yU)]

For interval numbers, the union of two disjoint sets has to be dealt with

(24)

out-put intervals being exactly equal to each of the inout-put intervals. Amongst all

operations performed by the ALU, this is the only operation which results in

two output intervals.

For two disjoint intervals, [xL, xU] and [yL, yU],

[xL, xU]∪[yL, yU] = [xL, xU] + [yL, yU]

• Intersection of Interval Numbers

Intersection of interval numbers is done in the same way as the intersection

operation in set theory. By definition, for two sets A and B, (A ∩ B) is

defined as a set containing only those elements that belong to set A and to set

B. For the intersection operation, the lower bound is obtained by determining

the maximum value of the lower bounds of the two input intervals while the

upper bound is obtained by determining the minimum value of the upper

bounds of the input intervals.

[xL, xU]∩[yL, yU] = [max(xL, yL), min(xU, yU)]

A null set is obtained for the intersection of two disjoint sets.

• Width

The “width” operation is performed on a single interval. Width of an interval

is defined as the difference between the upper bound and lower bound of the

interval. The output is naturally a single value.

(25)

width[xL, xU] =xU −xL

• Mid-point

The “mid-point” operation is also performed on a single interval. Mid-point

of an interval is obtained by taking the average of the lower bound and upper

bound of the input interval. Once again, the output is a single value.

midpoint[xL, xU] = (xU+xL)/2

Division by 2 is performed by right shifting the sum of the two bounds of the

interval by one bit. Sign extension by one bit also needs to be done.

The operations described above will be implemented on the proposed I-ALU.

Unique to this work will be the fact that these operations in conjunction with

(26)

Design Specifications of the ALU

The ALU is based on a parallel architecture where computation of the lower

bound and the upper bound of the output interval is simultaneously done. The

design is built for fixed point operation using the two’s complement representation

of numbers. Fixed-point two’s complement interval arithmetic and rounding are

described in detail in this section.

3.1 Fixed-point two’s complement arithmetic

The main focus of this design is to build a fixed-point interval arithmetic and

logic unit as against certain floating-point interval units that have been designed

previously and discussed in brief in section 1.2. To this end, it is important to get

familiar with the operations performed on fixed-point numbers. This section is an in

depth study on the working of fixed-point arithmetic. It explains the functionality of

the three basic operations viz. addition, subtraction and multiplication, performed

on fixed-point numbers. Given that our design is oriented towards DSP related

applications, division is not performed in hardware. Division by powers of 2 is

(27)

based on the application for which the ALU is going to be used once we have a proper

understanding of the working of fixed-point arithmetic. Irrespective of being a real

number ALU or an ALU for Interval Arithmetic, the logic behind the mathematics

that is being performed remains the same.

As the most generalized case, two’s complement format for representing the

fixed-point numbers has been used. It accounts for operations performed on both positive

and negative numbers.

3.1.1 Representation of numbers

In the binary number system, an N-bit word represents integer values from 0

to 2N −1. This is referred to as the unsigned integer representation. The

fixed-point representation has predefined position of the radix fixed-point, which means that

we have fixed number of bits reserved for the integer part and a fixed number for

the fractional part. A 32-bit number having 16 bits reserved for integer part and 16

for the fractional part is represented as 16:16. However, this mode lacks the ability

to represent negative numbers.

Twos complement method of representing fixed-point numbers accounts for both

positive and negative numbers. The MSB of this fixed-point number indicates the

sign (referred to as the sign-bit), whereas the rest of the bits define the magnitude of

the number. Figure3.1 shows the structure of an N-bit signed number in twos

com-plement format as used in this imcom-plementation. The range of numbers represented

0 1 N -1

sign fraction

Figure 3.1: Two’s complement number representation

(28)

Table 3.1: Two’s complement fixed-point representation

Binary Two’s Complement Decimal Equivalent

00010110.11000000 22.75

11101001.01000000 -22.75

00001000.00100000 8.125

11110111.11100000 -8.125

of numbers greatly simplifies the hardware implementation of the arithmetic being

performed. Table3.1provides a few examples of 16 bit two’s complement fixed-point

numbers in the 8:8 format.

3.1.2 Arithmetic Operations

This section provides examples of various operations performed on two’s

com-plement fixed point numbers. Three basic operations of addition, subtraction and

multiplication are considered. Let us go through each of these operations one at a

time.

• Addition

Addition involves simple addition of bits when the number is represented in

two’s complement form. The two operands are sign extended from 16 bits to

17 bits and the 17thbit of the result of addition is then sign extended to obtain

the 32 bit output. The following examples illustrate the addition operation:

1. 22.75 + (-8.125) = 14.625

0 0 0 0 1 0 1 1 0 . 1 1 0 0 0 0 0 0

+ 1 1 1 1 1 0 1 1 1 . 1 1 1 0 0 0 0 0

0 0 0 0 0 1 1 1 0 . 1 0 1 0 0 0 0 0

The 17th bit, 0, is used for sign extension and 7 zeros are added to

(29)

decimal part. Hence 00001110.10100000 in the 8:8 format is represented

as 0000000000001110.1010000000000000 in the 16:16 format.

2. (-8.125) + (-8.125) = (-16.25)

1 1 1 1 1 0 1 1 1 . 1 1 1 0 0 0 0 0

+ 1 1 1 1 1 0 1 1 1 . 1 1 1 0 0 0 0 0

1 1 1 1 0 1 1 1 1 . 1 1 0 0 0 0 0 0

The 17th _{bit, 1, is used for sign extension and 7 ones are added to}

the left. 8 zeros are added to the right of the number to perform sign

extension of the decimal part. Hence 11101111.11000000 in the 8:8 format

is represented as

1111111111101111.1100000000000000 in the 16:16 format.

Thus, in terms of hardware, the 16 bit number needs to be sign extended to

17 bits and the 17th bit of the result needs to be used for sign extension.

• Subtraction

Similar rules as followed in addition need to be followed while performing

subtraction of numbers in the two’s complement form. The only change that

needs to be done is that, we need to take the two’s complement of the number

to be subtracted and then add it to the other number which is in its two’s

com-plement form. The remaining procedure remains unchanged. The following

examples illustrate the subtraction operation:

1. 22.75 - 8.125 = 14.625

22.75 in two’s complement form is represented as 00010110.11000000.

Its two’s complement is 11110111.11100000. Hence,

0 0 0 0 1 0 1 1 0 . 1 1 0 0 0 0 0 0

+ 1 1 1 1 1 0 1 1 1 . 1 1 1 0 0 0 0 0

(30)

The 17th bit, 0, is used for sign extension and 7 zeros are added to

extension of the decimal part. This gives us the desired result of the

subtraction 14.625.

2. 8.125 - 22.75 = (-14.625)

It’s two’s complement is 11101001.01000000. Hence,

0 0 0 0 0 1 0 0 0 . 0 0 1 0 0 0 0 0

+ 1 1 1 1 0 1 0 0 1 . 0 1 0 0 0 0 0 0

1 1 1 1 1 0 0 0 1 . 0 1 1 0 0 0 0 0

The 17th _{bit, 1, is used for sign extension and 7 ones are added to}

extension of the decimal part. This gives us the desired result of the

subtraction -14.625.

3. 22.75 - (-8.125) = 30.625

It’s two’s complement is 00001000.00100000. Hence,

0 0 0 0 1 0 1 1 0 . 1 1 0 0 0 0 0 0

+ 0 0 0 0 0 1 0 0 0 . 0 0 1 0 0 0 0 0

0 0 0 0 1 1 1 1 0 . 1 1 1 0 0 0 0 0

The 17th _{bit, 0, is used for sign extension and 7 zeros are added to}

extension of the decimal part. Thus, we obtain the desired result 30.625.

• Multiplication

(31)

the issue of sign extension is not involved in multiplication. Multiplication of

two 16:16 numbers will result in a 32:32 number. In my examples, I consider

numbers of the 4:4 format. I have illustrated a couple of examples to explain

the multiplication operation:

1. 1.25 ∗3.25 = 4.0625

0001.0100∗ 0011.0100 = 00000100.00010000

2. 7.9375 ∗ 7.9375 = 63.00390625

0111.1111∗ 0111.1111 = 00111111.00000001

If any of the multiplicand is a negative number, we have to first take the

two’s complement of that number and then perform the usual multiplication

as explained above. The sign of the result will depend on the number of two’s

complements that we have taken before performing the multiplication. In

hardware, this is achieved by doing the exclusive-OR of the two sign bits.

Addition, subtraction and multiplication are three, very important operations

performed by the I-ALU. Functionality of all these three operations has been

elab-orately described in this section. This study goes a long way in determining the

hardware architecture of the system. Besides these operations, multiply-accumulate

forms the heart of any DSP processor. Special emphasis has been laid on this in

the following sections. Division by numbers other than degenerate powers of 2 has

less occurrence in DSP related applications. Division is implemented only by the

shift operation because of the cost of using division with respect to time and area.

Section 3.2 addresses the important issue of rounding.

3.2 Outward/Directed Rounding

For most systems, although the internal buses of an ALU may be wide enough,

they have fixed sized registers. Input to this system is 16-bit with an internal bus

(32)

for them to be stored in these smaller sized registers. This reduction in word length

is achieved by the rounding operation. The bits of lower significance of the output

are suitably discarded depending on the rounding direction of the operand. This

introduces errors called precision rounding errors. However, Interval Arithmetic

makes sure that the exact result of the operation lies within the output interval.

Provision is made in this system to round the output interval values from 32 bits to

either24 bits or 16 bits depending on an input provided. This provision is made to keep the design flexible for different applications.

As discussed earlier, the proposed system is based on two’s complement

fixed-point number representation. Two’s complement number representation greatly

simplifies the rounding algorithm because an identical procedure needs to be

fol-lowed for rounding up positive and negative numbers. Also, a different but identical

procedure is maintained for rounding down positive and negative numbers. IEEE

standard defines four rounding modes viz. round towards nearest, round towards

zero, round towards positive infinity and round towards negative infinity. While

ap-plying to Interval Arithmetic, we are concerned with two cases: rounding towards

positive infinity and rounding towards negative infinity. The algorithms for

round-ing towards positive infinity and negative infinity are explained below with suitable

examples.

• Rounding towards negative infinity.

Rounding towards negative infinity refers to denoting a high precision number

by the greatest machine representable number of low precision but smaller in

value. In fixed-point two’s complement representation, this is achieved by

simply discarding the bits of lower significance. This algorithm holds true for

positive and negative numbers as illustrated by the following examples:

– Positive numbers

6.78125 in the 8:8 format is represented as 00000110.11001000.

(33)

– Negative numbers

-6.78125 in the 8:8 format is represented as 11111001.00111000.

In the 8:4 format, it is represented as 11111001.0011, which is -6.8125.

• Rounding towards positive infinity.

Rounding towards positive infinity refers to denoting a high precision number

by the smallest machine representable number of low precision but greater

in value. In fixed-point two’s complement representation, this is achieved by

performing a logical ‘OR’ on the bits of lower significance to be discarded and

then adding this bit to the number to be retained. Once again, this algorithm

holds true for positive and negative numbers as illustrated by the following

examples:

– Positive numbers

6.78125 in the 8:8 format is represented as 00000110.11001000.

In the 8:4 format, it is represented as 00000110.1101, which is 6.8125.

– Negative numbers

-6.78125 in the 8:8 format is represented as 11111001.00111000.

In the 8:4 format, it is represented as 11111001.0100, which is -6.75.

The above examples can be used for rounding of 32 bit fixed-point numbers in the

16:16 format to 24 bit fixed-point numbers in the 16:8 format. A similar procedure

is followed if the output has to be reduced to 16 bits from 32 bits. These examples

cover all aspects of the rounding algorithm. It is called the “Outward Rounding”

or “Directed Rounding” algorithm and is responsible for the validation of results

provided by interval analysis. The study of this algorithm makes it very simple to

design the hardware to implement outward rounding. The proposed design

man-ifests a separate rounding unit which takes inputs from the functional units and

provides the outputs of the system.

After getting acquainted with the design specifications of the ALU, I now proceed

(34)

(35)

Hardware Architecture

This chapter of the thesis contains a description of all the modules that constitute

the Interval-ALU. It gives details of the logic design at the gate level for the whole

system, one module at a time. The hardware model at the RTL level of abstraction

is built from these logic designs. Since throughput is the main performance metric

to be optimized, the logic is designed with reduction of the critical path delay in

mind. Several pipelined versions of the design are built along with the basic

non-pipelined one to improve the throughput. I begin with the top level block diagram

of the design and then go into the details of each module.

4.1 Overall Architecture

The overall architecture of the ALU can be seen in the block diagram shown in

Figure 4.1. The hardware model is divided into four parts, viz. the flag generator,

lower bound and upper bound modules, and the rounding unit. The flag generator

module is responsible for generating the control signals for the more complicated

multiplication operation. As the name suggests, the lower bound module and the

(36)

respectively. These two modules are independent of each other and hence operate in

parallel. The rounding unit implements the Outward Rounding algorithm explained

earlier.

Figure 4.1: Top Level Block Diagram of the ALU

The ALU is designed for operation on 16 bit input interval numbers in the two’s

complement form. The ALU has an input line that allows selection of the

multiply-accumulate mode, acc select. The ALU operates in the accumulate mode as long

as this line is held high. Another input line, rctl, determines the number of output

bits. The output is rounded to 24 bits when this line is held high and 16 bits when

this line is held low. Table 4.1 lists all the inputs to the ALU.

The ALU has two 24-bit output lines that represent the lower and upper bounds

of the resulting interval. Besides this, there are output lines to highlight the special

(37)

Table 4.1: Description of ALU Inputs

Input Description Bit Width

xL Lower bound on left-hand operand 16 bits

xU Upper bound on left-hand operand 16 bits

yL Lower bound on right-hand operand 16 bits

yU Upper bound on right-hand operand 16 bits

command Mathematical operation to be performed 3 bits

acc select Perform MAC when asserted 1 bit

rctl Width of output results (16 or 24-bits) 1 bit

two disjoint sets. A further explanation of these output lines is provided in the

following sections. Table 4.2 lists all the outputs of the ALU.

Table 4.2: Description of ALU outputs

Output Description Bit Width

zL Lower bound on result 24 bits

zU Upper bound on result 24 bits

next Valid results on output lines 1 bit

union Union of disjoint sets 1 bit

empty Intersection of disjoint sets 1 bit

4.2 Flag Generator Module

A major significance of the design is the reduction in the number of

multiplica-tions performed to evaluate the result of an interval multiplication operation. The

flag generator module forms the control logic for the multiplication operation. It

(38)

multi-plication. Based on the values of the input operands, it generates a 4 bit mul flag

which selects among the nine cases. Table 4.3 shows the case to be selected based

on the value on the mul flag.

Table 4.3: mul flag for the Multiplication operation

mul Case Result

0001 xL≥0;yL ≥0 [xLyL,xUyU]

0010 xL≥0;yL <0< yU [xUyL,xUyU]

0011 xL≥0;yU ≤0 [xUyL,xLyU]

0100 xL<0< xU;yL ≥0 [xLyU,xUyU]

0101 xL<0< xU;yU ≤0 [xUyL,xLyL]

0110 xU ≤0;yL≥0 [xLyU,xUyL]

0111 xU ≤0;yL<0< yU [xLyU,xLyL]

1000 xU ≤0;yU ≤0 [xUyU,xLyU]

0000 xL<0< xU;yL <0< yU [min(xUyL,xLyU),

max(xLyL,xUyU)]

The logic behind the generation of this flag is shown in Figure 4.2. Table 4.4

explains the generation of the various flag signals used in Figure 4.2. As shown in

the table, the flag signals are generated based on the values of the input operands.

These flags are used to obtain the mul signal based on which the inputs to the

multipliers in the main functional units are selected. This reduces the number of

multiplications to be performed.

4.3 Lower Bound and Upper Bound Modules

The lower bound and the upper bound modules have very similar hardware

structures. The primary difference is the inputs that drive its functional units.

(39)

Table 4.4: Flag Generation

Condition Flag Generated

xL ≥0 flag 1

yL ≥0 flag 2

xU ≤0 flag 3

yU ≤0 flag 4

(xL <0)&&(xU >0) flag 5

(yL <0)&&(yU >0) flag 6

Figure 4.2: Flag Generation Module

is concentrated in them. Both modules are independent of each others operation

which makes the working of the ALU dependent only on the input signals and not

(40)

perform the accumulate operation. This is an important feature of the ALU because

dot product which is implemented through the multiply-accumulate operation forms

the core requirement of any DSP processor. Figure 4.3 and Figure 4.4 display the

basic block diagram of each of the two modules. As seen in the block diagrams, the

two modules are very similar in architecture. However, it is important to note the

different status lines generated by different portions of the modules.

Figure 4.3: Block Diagram of Lower Bound Module

Figure 4.4: Block Diagram of Upper Bound Module

In a non-pipelined architecture, the circuit performance is determined by these

modules of the design. Critical path of a design may be defined as the single slowest

feasible path contained in the design. Greater the logic depth, longer is the critical

(41)

throughput. Since it forms the critical path, a significant effort needs to be put in

to optimize it. Pipelining provides the ideal solution to improve the throughput

problem and is discussed in detail beginning section 4.5. I will now go through each

of the individual blocks that make up the ALU as shown in Figure .

4.3.1 Functional Units and Control Logic

The combinational logic required to perform arithmetic operations on interval

numbers is located in the functional unit block. Apart from the difference in a few

output status lines, the hardware for functional units in each of the two modules is

identical. Each module has an adder/subtracter, a multiplier and other

combinato-rial logic to implement the set operations. Figure 4.5 shows the functional unit in

the lower bound module. It is important to note that the inputs to the adder are

the two lower bounds of the input operands, while the inputs to the subtractor are

the lower bound of the first input operand and the upper bound of the second. The

type of operation and the mul signal are used as controls to determine the outputs

of the functional units. The status line empty is generated by this portion of the

design to indicate the intersection of two disjoint sets.

Figure 4.6 shows the functional unit in the upper bound module. The inputs

to the adder and subtractor are different for this module than the lower bound

module. The upper bounds of both the input operands drive the adder, while the

upper bound of the first input operand and the lower bound of the second input

operand drive the subtractor. Once again, themul flag determines the inputs to the

multiplier and the outputs of the union and intersection set operations. The status

line union is generated by this module which indicates the union of two disjoint

sets.

The outputs of the arithmetic units are given to the special case multiplication

(42)

Figure 4.5: Lower Bound Module

4.3.2 Special Case Multiplication Block

The situation in which both input operands include zero in their intervals

rep-resents a special case, and is referred to as the ‘Special Case Multiplication’. In

contrast to the normal cases where interval multiplication requires two

(43)

Figure 4.6: Upper Bound Module

of the bounds in this special case. Hence we require a memory element which would

store a value and make it available for comparison with the next available value. It

requires two multiplications and one comparison to be performed to obtain each of

the two bounds. The hardware to determine the lower bound and the upper bound

is identical and is repeated in both the modules. A status linenextis taken from this

(44)

hardware architecture of this block. As seen in the diagram, the lef t/right out c

line coming in from the functional unit block is stored as lef t/right out rand used for comparison. The result of this comparison is used only for the interval

multi-plication operation when the mul flag is 0000. The minimum value is selected for

the lower bound module and the maximum value is selected for the upper bound

module. The special case multiplication may lead to synchronization problems if

not dealt with properly.

Figure 4.7: Special Case Multiplication

4.3.3 Multiply-Accumulate Block

A multiply-accumulator forms a very important part of the ALU, more so,

be-cause it is intended for use in DSP applications. DSP applications are characterized

(45)

Mathematically, the dot product can be calculated as:

a·b =Pn_i₌₀aibi

This operation can be readily performed by a multiply accumulate block. Figure

4.8 shows the hardware architecture of the multiply-accumulate block. As we can

see in the figure, it consists of an adder and a memory element which acts as the

accumulator. An input line acc select determines whether the output needs to be accumulated or not. When high, the block is in accumulate mode. The output of

this block is 32 bits long and is given to the rounding unit, where outward rounding

is implemented.

Figure 4.8: Multiply-Accumulate Module

4.4 Rounding Unit

The rounding unit forms a critical part of the Interval-ALU. This unit performs

the outward rounding, which guarantees the result of a computation to lie within

the output interval. The proposed design has provision to round a 32 bit output

(46)

for which it is going to be used. An input line,rctl, determines the number of bits to

which the output has to be rounded. When this line is high, the output is rounded

to 24 bits, else it is rounded to 16 bits. The outward rounding algorithm has been

discussed in section 3.2, hence this section concentrates on describing the hardware

to implement these rounding modes. Figure4.9shows the architecture for rounding

the output of the lower bound module.

Figure 4.9: Lower Bound Rounding

From the figure we can see that 8 or 16 bits of lower significance are discarded

based on therctl input, and the higher 24 bits or 16 bits are retained.

The rounding operation for the upper bound module is slightly more complicated

as compared to the lower bound module. In this case, the bits of lower significance

are not simply discarded, but are logically ‘OR’ed and the resultant bit is added to

the bits to be retained. If the rctl line is high, the last 8 bits are logically ‘OR’ed

(47)

last 16 bits are logically ‘OR’ed and added to the 16 bits of high significance. Figure

4.10 shows the architecture for rounding the output of the upper bound module.

Figure 4.10: Upper Bound Rounding

This completes the architecture of the entire ALU. From the architecture, it is

safe to say that the lower bound module and the upper bound module are the

critical modules in the design. Maximum logic is concentrated in them and hence it

is important to optimize them to a high degree. The following section concentrates

on optimizing these critical modules to improve the throughput of the system, by

pipelining the design to the highest degree. It provides details of several pipelined

(48)

4.5 Pipeline Architecture of the Design

This section presents several versions of the I-ALU, pipelined to various degrees

so as to achieve maximum throughput. Pipelining is a technique used to reduce the

critical path of the circuit and hence improve the speed at which the circuit can

operate. Increase in throughput for any circuit by implementing pipelining comes at

the cost of increased area, increased power dissipation and increased initial latency.

Although the technique of pipelining may portend to have more disadvantages than

advantages, in DSP systems where throughput is of prime importance, it goes a

long way in improving the efficiency of the system. As seen earlier, DSP systems

are characterized by several multiplication and addition instructions. A multiplier

involving combinational logic alone has a very high logic depth and is one of the

main candidates that forms the critical path. Hence, to reduce this logic depth, it is

necessary to pipeline the multiplier. However, the depth to which a multiplier can

be pipelined saturates at a certain point and then it becomes necessary to further

pipeline the design to improve its efficiency. The following sections discuss about

these pipelining techniques in further detail.

4.5.1 Need for Pipelining

In the proposed I-ALU design, synthesis results have shown that without any

level of pipelining, the lower bound module and the upper bound module form the

critical path in the design. Figure 4.11 shows the critical path in these modules.

The diagram shows that the critical path traverses some control logic, a multiplier,

followed by some more combinational logic and then an adder. The huge path forces

the design to work at lower clock frequencies. The main portion of the logic in this

critical path is that of the multiplier. There would be a significant decrease in the

clock period if this portion of the logic were to be pipelined. Implementation of a

(49)

Figure 4.11: Critical Path

4.5.2 Partially Pipelined Design

Partially pipelined architecture refers to replacing the multiplier formed by

com-binational logic alone with a pipelined multiplier architecture. The design tool

‘Syn-opsys’ provides several pipelined multiplier architectures in its library that can be

used [33]. A significant increase in the circuit performance is observed from the use

of these Design-ware IP blocks. However, this improvement in performance comes

at the cost of an increased area and power dissipation. Hence, a suitable trade-off

needs to be done to choose the best design. Figure 4.12 shows the architecture of

an non-pipelined multiplier, while Figure4.13, Figure 4.14, Figure 4.15 and Figure

4.16show an abstract architecture of a two-level, three-level, four-level and five-level

deep pipelined multipliers, respectively.

As seen in Figure 4.12, the cloud of combinational logic in a non-pipelined

(50)

Figure 4.12: Non-Pipelined Multiplier Architecture

Figure 4.13: Two-stage Pipelined Multiplier Architecture

Figure 4.14: Three-stage Pipelined Multiplier Architecture

thereby reducing the critical path. The subsequent figures show that as the number

of stages of pipelining increase, the cloud size between two consecutive registers

decreases. Hence each pipelined multiplier operates at a faster clock than the

previ-ous. However, when one such pipeline multiplier is included as a part of the circuit,

after a certain level of pipelining, the multiplier ceases to be a part of the critical

(51)

Figure 4.15: Four-stage Pipelined Multiplier Architecture

Figure 4.16: Five-stage Pipelined Multiplier Architecture

the critical path. Thus, a pipelined multiplier contributes in a big way to improve

the throughput of a system, but to obtain more performance improvement, further

pipelining of the design is required, as discussed in the following section.

4.5.3 Highly Pipelined Design

A highly pipelined design involves the use of a pipelined multiplier which

pro-vides the best results for this design. As results in chapter 5 indicate, the

perfor-mance of the ALU saturates for pipelined multipliers with more than 3 stages. All

architectures that employ pipelined multipliers of more than 3 stages operate at

ap-proximately the same clock frequency. This is because after 3 stages, the multiplier

ceases to be a part of the critical path, but the maximum logic depth is formed

by other combinational logic involving multiplexors and adders as shown in Figure

4.11. For further performance enhancement, it is necessary to reduce this cloud of

combinational logic. Thus, improvement in the performance of the circuit would

(52)

from using a three stage pipelined multiplier. Figure 4.17 shows the logic diagram

for this highly optimized design.

Figure 4.17: Highly Pipelined Architecture

The figure shows that several registers are included in the critical path to reduce

its length. The combinational logic between any two consecutive registers is reduced

to some control logic or simply an adder. This produces a significant decrease in

the overall clock period of the circuit as would be evident from the synthesis results.

The amount of combinational logic is maximum in multipliers, followed by adders

and then the control logic, amongst the combinational logic blocks used in this

design. Hence it is important to introduce a register in the path prior to the adder

to reduce the clock period. Introduction of this register brings down the clock period

from 17.73 ns to 3.55 ns. However, introduction of two more stages of pipeline in

the control logic further decreases the clock period to 3.25 ns, as shall be seen in

the following section. This improvement in performance from 3.55 ns to 3.25 ns, a

(53)

In conclusion, this chapter forms the heart of the work presented in this thesis.

It provides a detailed explanation of the architecture of the overall design. Every

module and its functionality has been explained comprehensively. One of the most

important optimizations for the circuit by way of pipelining has been discussed.

During this course, the role played by multipliers in governing the performance

of the ALU has been mentioned and finally, a highly pipelined architecture which

employs a three stage multiplier will be shown to be the best design based on

synthesis results. The next chapter provides statistical results obtained from various

simulation and synthesis runs. Throughput, area and power dissipation have been

(54)

Testing and Results

This chapter provides a comparison of timing, area and power dissipation made

between the different architectures of the I-ALU based on statistical data obtained

from simulation and synthesis results. Based on its significance for DSP

applica-tions, throughput has been considered as the prime performance metric to analyze

the various designs and come up with an optimum solution. Efforts to improve the

throughput of the system have an adverse effect on the area, latency and power

dissipation of each design. Tabulations and graphical aids have been used to show

the comparisons between these metrics for various architectures.

The functionality of the design was verified by running simulations in the Cadence

environment. Verilog HDL was used to capture the behavior of the ALU while

Synopsys was used for synthesis purposes. The 0.18µm ‘OSU Standard Cell Library’ was used while synthesizing the various modules. Synopsys Design Compiler [34]

was used for timing analysis and Synopsys PrimePower [35] was used to obtain the

(55)

5.1 Simulation Results

Verification of the operation was done by running simulations on the design

for 100% code coverage. All possible input combinations were considered and the

results from the simulation runs were compared with the expected values. A special

note needs to be made of three cases; the union of two disjoint sets requires two clock

cycles, the intersection of two disjoint sets results into a null set and the special case

multiplication requires two clock cycles to obtain the result as against one clock cycle

required for all other operations. Figure 5.1 shows these results for the addition,

subtraction and multiplication operations. As seen in the figure, the results of

addition, subtraction and multiplication were obtained after one clock cycle each

i.e. this design has a latency of 1. However, for the special case multiplication,

the output was obtained only after two clock cycles. The status line next goes

high indicating the occurrence of this case and informing that the actual output

is available in the next clock cycle instead of the current. Also, as soon as the

acc select line goes high, the ALU goes into the accumulate mode as can be seen from the simulation results.

Figure 5.2 shows the simulation results for the interval union and intersection

operations. Apart from the usual behavior of union and intersection, the two special

cases are worth noting. The union of two disjoint sets results in two output intervals

on two consecutive clock cycles, and this is appropriately indicated by the union

status line. For the case of intersection of two disjoint sets, the empty status line

goes high indicating a null set, in which case the interval output is considered invalid.

5.2 Synthesis Results

The different architectures described in the previous chapter have been