GUPTE, RUCHIR. Interval Arithmetic Logic Unit for DSP and Control
Applica-tions. (Under the direction of Prof. William W. Edmonson).
There are many applications in the field of digital signal processing (DSP) and
controls that require the user to know how various numerical errors (uncertainty)
affect the result. Interval Arithmetic (IA) eliminates this uncertainty by replacing
non-interval values with intervals. Since most DSPs operate in real time
environ-ments, fast processors are needed. The goal is to develop a platform in which
interval arithmetic operations are performed at the same computational speed as
present day signal processors.
This thesis proposes a design for an interval based arithmetic logic unit (I-ALU)
whose computational time for implementing interval arithmetic operations is
equiv-alent to many digital signal processors. Many DSP and control applications require
a small subset of arithmetic operations that must be computed efficiently. This
de-sign has two independent modules operating in parallel to calculate the lower bound
and upper bound of the output interval. The functional unit of the ALU performs
the basic fixed-point interval arithmetic operations of addition, subtraction,
multi-plication and the interval set operations of union and intersection. In addition, the
ALU is optimized to perform dot products through the multiply-accumulate
instruc-tion. Division is not implemented on digital signal processors traditionally unless
computed with a shift operation. In this design, division by shifting is implemented.
One of the prime design goals is to maximize the throughput of the ALU for
an optimum value of area. Pipelining is implemented to achieve this design goal.
Power dissipation analysis of different ALU architectures is done. Since it required
to obtain maximum throughput for the least power dissipation, throughput per
unit power dissipation is used as the most critical performance metric. This thesis
studies several architectures for the ALU and concludes with the one with the
by
Ruchir Gupte
A thesis submitted to the Graduate Faculty of North Carolina State University
in partial fulfillment of the requirements for the Degree of
Master of Science
Electrical and Computer Engineering
Raleigh
2006
Approved By:
Dr. Winser E. Alexander Dr. William Rhett Davis
Dr. William W. Edmonson
Ruchir Gupte was born on December 9th, 1982 in Mumbai, India. He received
his Bachelor of Engineering (B.E.) degree in Electronics and Telecommunications
from the Mumbai University in June 2004. In Fall 2004, he began his graduate
studies in the Electrical and Computer Engineering Department at North Carolina
State University. Since Spring 2005, he has been working in the High Performance
Digital Signal Processing (HiPerDSP) Laboratory of Dr. Winser Alexander and Dr.
William Edmonson in the field of hardware support for interval analysis.
He worked at Sony Ericsson Mobile Communications Inc., Raleigh, as an intern
from May 2005 to August 2005. He has also taken keen interest in community
participation and has been a committee member of the NC State Indian Graduate
Student Association called MAITRI. Moreover, he has extended his support in
volu-teering for various events organized by the Department of Electrical and Computer
Above all, I thank my parents for the much needed motivation throughout the
duration of my stay away from home. It was their love and support that helped
me maintain sanity during stressful times. My sister, Sheetal, has been a great
inspiration for me throughout.
I sincerely acknowledge the efforts of Dr. William Edmonson, my academic
advi-sor, in providing guidance and encouragement for the successful completion of this
thesis. Dr. Edmonson has made available all resources that I could possibly need
and also allowed the independence of applying my ideas in this project. I am deeply
indebted to him for his patience and invaluable suggestions during the course of this
project.
I am also grateful to the other members of my thesis committee, Dr. Winser
Alexander and Dr. Rhett Davis for devoting their time and providing useful inputs.
Completion of this work would not have been possible without their guidance.
I sincerely wish to express my gratitude to the HiPer DSP Research group for
creating an environment that has been fabulous for research and fun. Additional
thanks to Ramsey Hourani and Senanu Ocloo for their unconditional help
through-out my stay in the group. The encouragement and moral support extended by all
members of the group through good and hard times cannot be described in words.
Special thanks to Ravi Jenkal for his inputs and help.
Finally, I would like to thank those near and dear to me, without whose backing,
this thesis would have been a distant reality. I am grateful to all my friends at
Raleigh for being there, a special mention to my roommate, Karan Tewari, for his
List of Tables vii
List of Figures viii
1 Introduction 1
1.1 Motivation . . . 4
1.2 Background . . . 5
1.3 Contribution. . . 7
1.4 Thesis Organization. . . 8
2 Interval Arithmetic 9 2.1 Interval Arithmetic and Set Operations . . . 10
3 Design Specifications of the ALU 16 3.1 Fixed-point two’s complement arithmetic . . . 16
3.1.1 Representation of numbers . . . 17
3.1.2 Arithmetic Operations . . . 18
3.2 Outward/Directed Rounding . . . 21
4 Hardware Architecture 25 4.1 Overall Architecture . . . 25
4.2 Flag Generator Module . . . 27
4.3 Lower Bound and Upper Bound Modules . . . 28
4.3.1 Functional Units and Control Logic . . . 31
4.3.2 Special Case Multiplication Block . . . 32
4.3.3 Multiply-Accumulate Block . . . 34
4.4 Rounding Unit . . . 35
4.5 Pipeline Architecture of the Design . . . 38
4.5.1 Need for Pipelining . . . 38
4.5.3 Highly Pipelined Design . . . 41
5 Testing and Results 44 5.1 Simulation Results . . . 45
5.2 Synthesis Results . . . 45
5.2.1 Non-Pipelined Architecture . . . 47
5.2.2 Design with Pipeline Multipliers . . . 49
5.2.3 Highly Pipelined Design . . . 54
5.3 Power Analysis . . . 57
5.3.1 Generating Input Vectors . . . 59
5.3.2 Statistical Results from Power Scripts. . . 60
6 Conclusions and Future Work 66 6.1 Conclusions . . . 66
6.2 Future Work . . . 68
2.1 Nine Cases in Multiplication . . . 12
3.1 Two’s complement fixed-point representation . . . 18
4.1 Description of ALU Inputs . . . 27
4.2 Description of ALU outputs . . . 27
4.3 mul flag for the Multiplication operation . . . 28
4.4 Flag Generation . . . 29
5.1 Timing Reports for the Non-Pipelined Architecture . . . 47
5.2 Area Reports of Non-Pipelined Architecture . . . 49
5.3 Timing Reports for the Pipelined Multipliers . . . 51
5.4 Timing Reports for various Pipelined Architectures . . . 51
5.5 Area Reports for various Pipelined Architectures. . . 52
5.6 Timing Reports for Non-Pipelined and Highly-Pipelined Architectures 56 5.7 Results for Non-Pipelined and Highly-Pipelined Architectures . . . 56
5.8 Area Reports for Non-Pipelined and Highly-Pipelined Architectures 56 5.9 Power Dissipation for Different Architectures with 500 Input Vectors 60 5.10 Power Dissipation for 3-stage Pipelined Architecture. . . 62
5.11 Power Dissipation for All Stages with different Input Vectors . . . . 63
5.12 Throughput per unit Power Dissipation for All Architectures . . . . 63
3.1 Two’s complement number representation . . . 17
4.1 Top Level Block Diagram of the ALU . . . 26
4.2 Flag Generation Module . . . 29
4.3 Block Diagram of Lower Bound Module. . . 30
4.4 Block Diagram of Upper Bound Module . . . 30
4.5 Lower Bound Module . . . 32
4.6 Upper Bound Module. . . 33
4.7 Special Case Multiplication . . . 34
4.8 Multiply-Accumulate Module . . . 35
4.9 Lower Bound Rounding . . . 36
4.10 Upper Bound Rounding . . . 37
4.11 Critical Path . . . 39
4.12 Non-Pipelined Multiplier Architecture . . . 40
4.13 Two-stage Pipelined Multiplier Architecture . . . 40
4.14 Three-stage Pipelined Multiplier Architecture . . . 40
4.15 Four-stage Pipelined Multiplier Architecture . . . 41
4.16 Five-stage Pipelined Multiplier Architecture . . . 41
4.17 Highly Pipelined Architecture . . . 42
5.1 Simulation Results for Add, Subtract and Multiply . . . 46
5.2 Simulation Results for Interval Union and Intersection . . . 46
5.3 Timing Reports of Non-Pipelined Architecture . . . 48
5.4 Area Reports for the Non-Pipelined Architecture . . . 50
5.5 Timing Reports of Different Pipelined Architecture . . . 53
5.6 Area Reports of Different Pipelined Architecture. . . 54
5.7 Timing and Area Reports of Different Pipelined Architectures . . . 55
5.11 Power Dissipation for different Input Vectors for 3-stage Pipelined
Multiplier Design . . . 62
5.12 Power Dissipation for different Input Vectors for All Architectures . 64
Introduction
Interval Arithmetic (IA) performs computations on intervals of real numbers
instead of real numbers themselves. It takes into account the numerical errors that
occur due to performing arithmetic on a computer. This is a problem that occurs
on all computers that make use of binary number systems, such as the IEEE 754
Standard for Binary Floating-Point Number Systems [1]. As of now,
implementa-tion of interval arithmetic is performed in software. The GNU Fortran Compiler
has been modified to provide support for an interval data type [2], based on the
Interval Arithmetic Specification [3]. The SUN Studio Fortran95 compiler provides
support for interval operations [4]. The SUN studio has compilers and tools that
support C and C++ development as well [5]. The main disadvantage of software
implementation is the slow speed. They incur tremendous overhead due to reasons
such as changing of rounding modes, function calls, exception handling, memory
management et al. For instance, the operation of multiplication requires several
conditional branches to determine which interval end-points need to be multiplied.
Based on the values of the input intervals relative to zero, nine different cases of
multiplication have to be accounted for to select the end-points. A large number
of conditional statements are required to select between these multiplication cases.
time consuming steps. The performance penalty to be paid for misprediction of
con-ditional branches is quite heavy in the case of fully pipelined processors. Changing
of rounding modes in software also requires a large number of computational cycles.
On many processors, changing the rounding mode causes the entire floating-point
pipeline to be flushed, which results in a delay of several cycles and severely limits
parallel execution. Furthermore, software implementations of interval
multiplica-tion are typically implemented as subroutines, which adds overhead for subroutine
calls and returns.
Thus, interval algorithms end up running slower on current computer
architec-tures compared to their real arithmetic counterparts [6]. Software implementations
are found to be as much as four times slower than functionally equivalent hardware
[7]. Hardware support is required to overcome these performance drops caused by
the above software issues.
Applications of digital signal processing (DSP) involve a very large number of
arithmetic operations, and the necessity of obtaining accurate results makes it
im-perative to perform reliable numerical computations. The goal of this design is to
solve problems in the DSP field with higher accuracy and at a faster rate. Since
software implementations are slow, it is necessary to build dedicated hardware in
or-der to achieve this goal. Interval methods form one of the solutions to reduce errors
resulting from numerical computations. This was the motivation behind building
an Interval ALU dedicated to DSP and Control applications.
Interval based algorithms continue to find applications as the solution for signal
processing and controls problems. For instance, in signal processing, there is usually
the need to determine the optimal solution to a problem, i.e., to minimize a cost
function. The ability of interval global optimization approaches to guarantee
con-vergence to global minimum point(s) is one that makes such approaches attractive
in DSP and control applications. Having optimum hardware for global
mentioned earlier. DSP and control algorithms need to be designed in such a way
that roundoff and truncation errors that occur naturally due to the discrete nature
of computing do not prevent the algorithm from converging to the global minimum.
Interval analysis provides a means of managing such errors. It is therefore possible
to obtain numerically accurate and reliable results. Reliable results may be defined
as the solutions in which the obtained value is guaranteed to be the exact result
of the operation being performed. Interval analysis gives an interval as the output
which certainly contains the exact value of the result expected from an operation
performed on two input interval numbers. It is thus capable of providing reliable
results.
These results can be achieved by using arithmetic logic units (ALU) that are
specially designed to manipulate interval numbers. Such an Interval ALU (I-ALU)
can be used as the core of a digital signal processor. In contrast to general
pur-pose microprocessors that are designed to handle general computing tasks, digital
signal processors are designed and optimized to operate on algorithms that are
characterized by repetitive multiply-and-add operations. In general, they feature
fast multiply-accumulate instructions, multiple-access memory, specialized program
control for interrupt handling and I/O, and fast and efficient access to peripherals.
We desire to achieve maximum efficiency while providing these features that make
up a good digital signal processor.
Throughput is the most important metric to analyze the performance of a DSP
system. The throughput problem will have to be solved for interval algorithms to
become more practical. The throughput of an I-ALU will have to be comparable to
that of non-interval units. Pipelining provides an effective solution to improving the
throughput of the design. By definition, pipelining is an implementation technique
where multiple instructions are overlapped in execution. Each stage completes a
part of an instruction in parallel. Pipelining does not decrease the time for individual
instruction execution. It increases instruction throughput, instead. Throughput of
depth of the pipeline adversely affects the area of the design. Hence an optimum
design would involve a proper trade-off between the throughput and area, where
the throughput would have more importance in signal processing applications.
1.1
Motivation
Digital Signal Processing has become the choice for many applications related
to communications, control, multimedia, et al. because of the high performance it
achieves for applications that involve limited instruction set for implementing
repet-itive linear operations such as addition, multiplication, delay, et al. on a stream of
sampled data. Often a DSP has been used as an attached coprocessor or combined
with one or more FPGA devices to meet the performance and cost requirements
for a particular application. Common to DSPs is the ability to perform
multiply-accumulate (MAC) operations in a single instruction cycle. This operation is key to
performing vector products which is key to computing fourier transforms and
corre-lation. Other features of a DSP include the ability for accessing multiple memories,
dedicated address units for simultaneous access to data memories and program
mem-ory modulo addressing. Several DSP manufacturers also include specialized
periph-erals along with fast interrupt handling. Examples of these specialized periphperiph-erals
include analog-digital converters and I/O for multiprocessor communications.
Underlying many of these applications is the need for accurate and reliable
re-sults, but errors due to rounding, uncertainty of the data, quantization noise and
catastrophic cancelation in floating point computations can lead to inaccuracies.
Sometimes these inaccuracies can go unnoticed. For many applications in signal
processing, operations are recursive and act on a sequence of data. This implies
that numerical errors can grow unbounded over time. An efficient method for
mon-itoring and controlling these inaccuracies is to replace point arithmetic with interval
Digital signal processors represent one of the fastest growing segments of the
embedded world. Despite their vast use, DSPs present difficult challenges for
pro-grammers. Since computation speed is of critical importance to DSP applications,
DSPs focus on supporting fixed-point operations.
Use of fixed-point representation not only requires the programmers to deal with
mathematically sophisticated equipment, but also are required to deal with errors
that are introduced due the use of reduced-precision arithmetic. Although it would
be ideal to use floating-point arithmetic over fixed point, as a practical
considera-tion, fixed-point processors operate at a much faster rate than their floating-point
counterparts. Fixed-point DSPs execute at gigahertz range; floating-point DSPs
peak out in the 300-400 megahertz range. Fixed-Point DSPs enjoy another
advan-tage of being consumed in large volumes as a result of which their price per chip is a
fraction of the price of a floating point DSPs [8]. Fixed-point processors gain speed
and power efficiency over floating-point processors at the cost of reduced precision.
However, DSP applications rarely require the full dynamic range offered by
floating-point number system. This justifies the choice of using fixed floating-point arithmetic for
our ALU design over floating point arithmetic.
As mentioned earlier, hardware solutions are needed over software implementation
to solve the speed problem. This brings up the idea of building an Arithmetic Logic
Unit dedicated to perform arithmetic operations on interval inputs in the fixed
point representation. The aim is to develop hardware that is optimized for interval
operations pertinent to signal processing applications such as addition, subtraction,
multiplication and multiply-accumulate.
1.2
Background
Interval algorithms have found their usage in applications such as global
solving differential equations [14], solving non-linear equations [15]et al. In most of
these cases, interval arithmetic is used to solve problems which cannot be efficiently
solved using conventional floating-point arithmetic.
Several software tools have been developed to support interval arithmetic. These
include interval arithmetic libraries in Fortran [16], [17], [18], C++ [19], extended
scientific programming languages such as PASCAL [20], C++ [21], Fortran [22],
[23] and interval-enhanced compilers [24]. Inspite of these developments in the field,
interval arithmetic has not gained popularity owing to the speed issues when
com-pared to conventional floating point methods. It is believed that the performance
of interval arithmetic needs to be within a factor of five of floating-point arithmetic
for it to gain general acceptance [25]. Hardware support for interval arithmetic is
required to achieve this high performance. Several interval based hardware designs
have been implemented, a few of which have been listed below:
• Hardware Interval Multipliers [26]
The author presents serial and parallel hardware units for interval
multiplica-tion. These units provide automatic interval end-point selection and correct
rounding of results. While the serial interval multiplier uses a single multiplier
unit, the parallel multiplier uses dual multipliers to compute the interval
end-points simultaneously. These multipliers provide a significant performance
boost for acceptable increases in area.
• A Combined Interval and Floating Point Multiplier [27]
This design is based on the approach that an interval multiplier can share
hardware with a existing floating point multiplier, thereby achieving the
per-formance benefits of a interval multiplier at relatively low costs. The design
resorts to software solutions to solve the uncommon case of multiplication
where both end-points contain zero. Interval multiplication requires only one
more cycle than floating point multiplication, and is one to two orders of
• A Combined Interval and Floating Point Divider [28]
This design follows a similar approach as above, where an existing floating
point divider is modified to enable interval division on it. Based on the values
of interval inputs relative to zero, seven different cases of division are
ad-dressed. Interval division can be performed after modifying the floating point
divider with a 24% increase in area.
• A Combined Interval and Floating Point Comparator [29]
This design is an implementation of a combined interval and floating-point
comparator, which performs interval intersection, hull, minimum, maximum
and comparisons, as well as floating-point minimum, maximum and
compar-isons. It has around 98% more area than conventional floating-point
compara-tors and a worst case delay that is 42% greater.
• Variable Precision Interval Arithmetic Processors [30]
The author presents designs, arithmetic algorithms and software support for
a family of variable precision, interval arithmetic processors. These processors
give the programmer the ability to detect, and if desired, to correct the implicit
errors in finite precision numerical computations. The processors are two to
three orders of magnitude faster than software packages that provide similar
functionality.
However, all of the above architectures are designed for floating-point
represen-tation of numbers. Although these are high-precision compurepresen-tational units, they
have lower throughput than their potential fixed-point counterparts. As mentioned
earlier, they also have higher design complexity and hence are undesirable for DSP
applications.
1.3
Contribution
The following thesis designs the hardware architecture of an I-ALU and optimizes
two’s complement representation of numbers. Two’s complement representation
is most convenient to perform arithmetic because of its uniformity over positive
and negative numbers while performing operations and rounding. Although
fixed-point arithmetic reduces the precision of results, the precision provided by it is
sufficient for DSP related applications. Besides, it has the added advantage of
reduced complexity of the design and higher throughput. A basic hardware model
has been built at the RTL level of abstraction, and the design has been modified
for better efficiency by use of pipelined multipliers of increasing depths. These
designs have been explored and statistical data, based on the results of simulations
and synthesis, has been used to determine the most optimal solution. Throughput,
area, power dissipation and numerical reliability are the performance metrics used
for system evaluation.
1.4
Thesis Organization
Chapter 2 introduces the reader to the concept and conventional representation
of Interval Numbers. Various arithmetic and set operations which can be performed
on these numbers by the I-ALU are discussed in detail in this chapter. Chapter 3
provides the design specifications of the proposed ALU. The significance of using
two’s complement representation of numbers can be seen here along with the details
of the rounding issue. Chapter 4 describes in detail, the hardware architecture of
the ALU. A comprehensive description of each module constituting the ALU has
been given here. The issue of rounding has been addressed. Chapter 5 provides
the results of simulation runs and synthesis performed on different architectures of
the design. An exhaustive comparison of the results from the non-pipelined design
and various versions of the pipelined design has been done to arrive at an optimal
solution. Throughput, area and power dissipation are used as the performance
met-rics. Finally, I conclude my work with Chapter 6 providing the details of the future
work that can be done on the design to broaden its scope, improve functionality
Interval Arithmetic
In the words of Ramon E. Moore [31], “If we have, in addition to the results of
a computation, error bounds for the differences between the results and the exact
solution values, then no matter how these error values were obtained, by analytical
means or by further machine computations during or after the given computation,
it will always be the case that we have, in effect, for each exact result sought, a pair
of numbers: an approximate value and an error bound, or an upper and a lower
bound to the exact result.”
Real numbers can be infinite precision. All machines are inherently finite
preci-sion. Owing to this nature, real numbers are approximated to get them to machine
representable forms. This error bound may be considered as anuncertainty. Interval
Analysis is a means of representing this uncertainty by replacing single (fixed-point)
values by intervals. It provides a means of bounding the errors that accrue due to
the discrete nature of computing.
An interval number is defined to be an ordered pair of real numbers [a, b], such
that a ≤ b. Using the notation{x|P(x)}for “the set of x such that the proposition
[a, b] = {x|a ≤ x ≤ b } where x∈ <
Using this convention, real numbers are represented as intervals with identical
up-per and lower bounds. Such intervals are called “Degenerate Intervals” and appear
to have the form [a, a]. The usual operations of addition, subtraction,
multiplica-tion, and division that are possible with real numbers are also defined for interval
numbers [32].
An interval number is also a set of real numbers. The interval number [a, b]
is a set of real numbers x such that a ≤ x ≤ b. Hence, set operations of union
and intersection can also be done on interval numbers. Section 2.1 discusses in
depth, the various arithmetic and set operations performed by the proposed ALU.
In interval arithmetic, the true result is guaranteed to lie within the resulting
in-terval. This is achieved by the Outward Rounding algorithm. Outward rounding
on an interval X = [a, b] is achieved by rounding the lower bound,a, to the largest
machine-representable number smaller thana, and the upper boundb, to the
small-est machine-representable number larger than b. This involves the use of theround
down and round up modes on the lower and upper bounds, respectively. Directed
Rounding capabilities, that is, the ability to round down or round up has been
available since the Intel 8087 chip. As a result, interval arithmetic is possible on
virtually any computer.
2.1
Interval Arithmetic and Set Operations
As described in the previous section, interval numbers are represented by an
ordered pair [a, b] such that a ≤b. The arithmetic interval operations of addition,
subtraction and multiplication will be discussed in this section. The rules for
the set operations of union and intersection along with the calculation of width and
mid-point of a single interval and also described.
In the following discussion, we consider two input interval numbers. They are
represented as [xL, xU] and [yL, yU], where xL and yL are the lower bounds and xU
and yU are the upper bounds of the two intervals. Except for one special case of set
union of two disjoint sets, all operations result in a single output interval number.
The outputs of various interval operations are obtained as follows:
• Addition
Addition of interval numbers is a straightforward operation where the lower
bound of the output interval is obtained from the sum of the lower bounds of
the input intervals, while the upper bound of the output interval is obtained
from the sum of the upper bounds of the input intervals.
Mathematically, this can be represented as:
[xL, xU] + [yL, yU] = [xL+yL, xU +yU]
• Subtraction
In subtraction, lower bound of the output interval is obtained by subtracting
the upper bound of one interval number from the lower bound of the other
interval number. Similarly, upper bound of the output interval is obtained
by subtracting the lower bound of the second interval number from the upper
bound of the first interval number.
Mathematically, this can be represented as:
Table 2.1: Nine Cases in Multiplication
Case Condition Result
1 xL ≥0;yL≥0 [xLyL,xUyU]
2 xL ≥0;yL<0< yU [xUyL,xUyU]
3 xL ≥0;yU ≤0 [xUyL,xLyU]
4 xL <0< xU;yL≥0 [xLyU,xUyU]
5 xL <0< xU;yU ≤0 [xUyL,xLyL]
6 xU ≤0;yL≥0 [xLyU,xUyL]
7 xU ≤0;yL<0< yU [xLyU,xLyL]
8 xU ≤0;yU ≤0 [xUyU,xLyU]
9 xL <0< xU;yL<0< yU [min(xUyL,xLyU),
max(xLyL,xUyU)]
• Multiplication
Multiplication presents a more difficult problem than addition and
subtrac-tion. Unlike these two operations, apart from the magnitude, the sign of the
operands also needs to be taken into consideration. Both, sign and
magni-tude of operands decide which two values are to be multiplied to obtain the
lower and upper bounds separately. Under normal circumstances, the result
of multiplication of two input intervals would be obtained as follows:
If [xL, xU] ∗ [yL, yU] = [zL, zU], then,
zL = min(xLyL, xLyU, xUyL, xUyU) and
zU = max(xLyL, xLyU, xUyL, xUyU)
These computations require eight multiplications and several comparisons
to be performed before the lower and upper bounds of the intervals can be
obtained. This makes the multiplication operation highly inefficient. To
over-come this problem, the multiplication operation is split into 9 different cases
based on the values of the operands with respect to zero. Table2.1 lists these
From this table, it can be observed that the task of obtaining the lower
bound and the upper bound of the output interval is reduced to two
multipli-cations compared to the eight multiplimultipli-cations that were required when a brute
force method was followed. Comparisons of the input values need to be done
initially to determine which case they belong to. Reduction in the number of
multiplications required to be done to determine the output helps in making
the design more hardware efficient.
A special mention needs to be made of case 9, where both the input intervals
include zero in them. Although this would be a rare case in high resolution
processors, it needs to be addressed for the purpose of numerical reliability. As
can be seen from the table, the output for this case requires 4 multiplications
and 2 comparisons to be performed. This case leads to increased complexity
in the design and from the hardware point of view requires double the amount
of computational time as compared to other operations.
• Union of Interval Numbers
Union of interval numbers is done in the same way as the union operation
in set theory. By definition, for two sets A and B, (A ∪ B) is defined as a
set containing all elements of set A and all elements of set B. Similarly for
interval numbers, to perform the union operation, the lower bound is obtained
by determining the minimum value of the lower bounds of the two input
intervals while the upper bound is obtained by determining the maximum
value of the upper bounds of the input intervals.
Mathematically, this can be represented as:
[xL, xU]∪[yL, yU] = [min(xL, yL), max(xU, yU)]
For interval numbers, the union of two disjoint sets has to be dealt with
out-put intervals being exactly equal to each of the inout-put intervals. Amongst all
operations performed by the ALU, this is the only operation which results in
two output intervals.
Mathematically, this can be represented as:
For two disjoint intervals, [xL, xU] and [yL, yU],
[xL, xU]∪[yL, yU] = [xL, xU] + [yL, yU]
• Intersection of Interval Numbers
Intersection of interval numbers is done in the same way as the intersection
operation in set theory. By definition, for two sets A and B, (A ∩ B) is
defined as a set containing only those elements that belong to set A and to set
B. For the intersection operation, the lower bound is obtained by determining
the maximum value of the lower bounds of the two input intervals while the
upper bound is obtained by determining the minimum value of the upper
bounds of the input intervals.
Mathematically, this can be represented as:
[xL, xU]∩[yL, yU] = [max(xL, yL), min(xU, yU)]
A null set is obtained for the intersection of two disjoint sets.
• Width
The “width” operation is performed on a single interval. Width of an interval
is defined as the difference between the upper bound and lower bound of the
interval. The output is naturally a single value.
width[xL, xU] =xU −xL
• Mid-point
The “mid-point” operation is also performed on a single interval. Mid-point
of an interval is obtained by taking the average of the lower bound and upper
bound of the input interval. Once again, the output is a single value.
Mathematically, this can be represented as:
midpoint[xL, xU] = (xU+xL)/2
Division by 2 is performed by right shifting the sum of the two bounds of the
interval by one bit. Sign extension by one bit also needs to be done.
The operations described above will be implemented on the proposed I-ALU.
Unique to this work will be the fact that these operations in conjunction with
Design Specifications of the ALU
The ALU is based on a parallel architecture where computation of the lower
bound and the upper bound of the output interval is simultaneously done. The
design is built for fixed point operation using the two’s complement representation
of numbers. Fixed-point two’s complement interval arithmetic and rounding are
described in detail in this section.
3.1
Fixed-point two’s complement arithmetic
The main focus of this design is to build a fixed-point interval arithmetic and
logic unit as against certain floating-point interval units that have been designed
previously and discussed in brief in section 1.2. To this end, it is important to get
familiar with the operations performed on fixed-point numbers. This section is an in
depth study on the working of fixed-point arithmetic. It explains the functionality of
the three basic operations viz. addition, subtraction and multiplication, performed
on fixed-point numbers. Given that our design is oriented towards DSP related
applications, division is not performed in hardware. Division by powers of 2 is
based on the application for which the ALU is going to be used once we have a proper
understanding of the working of fixed-point arithmetic. Irrespective of being a real
number ALU or an ALU for Interval Arithmetic, the logic behind the mathematics
that is being performed remains the same.
As the most generalized case, two’s complement format for representing the
fixed-point numbers has been used. It accounts for operations performed on both positive
and negative numbers.
3.1.1
Representation of numbers
In the binary number system, an N-bit word represents integer values from 0
to 2N −1. This is referred to as the unsigned integer representation. The
fixed-point representation has predefined position of the radix fixed-point, which means that
we have fixed number of bits reserved for the integer part and a fixed number for
the fractional part. A 32-bit number having 16 bits reserved for integer part and 16
for the fractional part is represented as 16:16. However, this mode lacks the ability
to represent negative numbers.
Twos complement method of representing fixed-point numbers accounts for both
positive and negative numbers. The MSB of this fixed-point number indicates the
sign (referred to as the sign-bit), whereas the rest of the bits define the magnitude of
the number. Figure3.1 shows the structure of an N-bit signed number in twos
com-plement format as used in this imcom-plementation. The range of numbers represented
0 1 N -1
sign fraction
Figure 3.1: Two’s complement number representation
Table 3.1: Two’s complement fixed-point representation
Binary Two’s Complement Decimal Equivalent
00010110.11000000 22.75
11101001.01000000 -22.75
00001000.00100000 8.125
11110111.11100000 -8.125
of numbers greatly simplifies the hardware implementation of the arithmetic being
performed. Table3.1provides a few examples of 16 bit two’s complement fixed-point
numbers in the 8:8 format.
3.1.2
Arithmetic Operations
This section provides examples of various operations performed on two’s
com-plement fixed point numbers. Three basic operations of addition, subtraction and
multiplication are considered. Let us go through each of these operations one at a
time.
• Addition
Addition involves simple addition of bits when the number is represented in
two’s complement form. The two operands are sign extended from 16 bits to
17 bits and the 17thbit of the result of addition is then sign extended to obtain
the 32 bit output. The following examples illustrate the addition operation:
1. 22.75 + (-8.125) = 14.625
0 0 0 0 1 0 1 1 0 . 1 1 0 0 0 0 0 0
+ 1 1 1 1 1 0 1 1 1 . 1 1 1 0 0 0 0 0
0 0 0 0 0 1 1 1 0 . 1 0 1 0 0 0 0 0
The 17th bit, 0, is used for sign extension and 7 zeros are added to
decimal part. Hence 00001110.10100000 in the 8:8 format is represented
as 0000000000001110.1010000000000000 in the 16:16 format.
2. (-8.125) + (-8.125) = (-16.25)
1 1 1 1 1 0 1 1 1 . 1 1 1 0 0 0 0 0
+ 1 1 1 1 1 0 1 1 1 . 1 1 1 0 0 0 0 0
1 1 1 1 0 1 1 1 1 . 1 1 0 0 0 0 0 0
The 17th bit, 1, is used for sign extension and 7 ones are added to
the left. 8 zeros are added to the right of the number to perform sign
extension of the decimal part. Hence 11101111.11000000 in the 8:8 format
is represented as
1111111111101111.1100000000000000 in the 16:16 format.
Thus, in terms of hardware, the 16 bit number needs to be sign extended to
17 bits and the 17th bit of the result needs to be used for sign extension.
• Subtraction
Similar rules as followed in addition need to be followed while performing
subtraction of numbers in the two’s complement form. The only change that
needs to be done is that, we need to take the two’s complement of the number
to be subtracted and then add it to the other number which is in its two’s
com-plement form. The remaining procedure remains unchanged. The following
examples illustrate the subtraction operation:
1. 22.75 - 8.125 = 14.625
22.75 in two’s complement form is represented as 00010110.11000000.
8.125 in two’s complement form is represented as 00001000.00100000.
Its two’s complement is 11110111.11100000. Hence,
0 0 0 0 1 0 1 1 0 . 1 1 0 0 0 0 0 0
+ 1 1 1 1 1 0 1 1 1 . 1 1 1 0 0 0 0 0
The 17th bit, 0, is used for sign extension and 7 zeros are added to
the left. 8 zeros are added to the right of the number to perform sign
extension of the decimal part. This gives us the desired result of the
subtraction 14.625.
2. 8.125 - 22.75 = (-14.625)
8.125 in two’s complement form is represented as 00001000.00100000.
22.75 in two’s complement form is represented as 00010110.11000000.
It’s two’s complement is 11101001.01000000. Hence,
0 0 0 0 0 1 0 0 0 . 0 0 1 0 0 0 0 0
+ 1 1 1 1 0 1 0 0 1 . 0 1 0 0 0 0 0 0
1 1 1 1 1 0 0 0 1 . 0 1 1 0 0 0 0 0
The 17th bit, 1, is used for sign extension and 7 ones are added to
the left. 8 zeros are added to the right of the number to perform sign
extension of the decimal part. This gives us the desired result of the
subtraction -14.625.
3. 22.75 - (-8.125) = 30.625
22.75 in two’s complement form is represented as 00010110.11000000.
8.125 in two’s complement form is represented as 11110111.11100000.
It’s two’s complement is 00001000.00100000. Hence,
0 0 0 0 1 0 1 1 0 . 1 1 0 0 0 0 0 0
+ 0 0 0 0 0 1 0 0 0 . 0 0 1 0 0 0 0 0
0 0 0 0 1 1 1 1 0 . 1 1 1 0 0 0 0 0
The 17th bit, 0, is used for sign extension and 7 zeros are added to
the left. 8 zeros are added to the right of the number to perform sign
extension of the decimal part. Thus, we obtain the desired result 30.625.
• Multiplication
the issue of sign extension is not involved in multiplication. Multiplication of
two 16:16 numbers will result in a 32:32 number. In my examples, I consider
numbers of the 4:4 format. I have illustrated a couple of examples to explain
the multiplication operation:
1. 1.25 ∗3.25 = 4.0625
0001.0100∗ 0011.0100 = 00000100.00010000
2. 7.9375 ∗ 7.9375 = 63.00390625
0111.1111∗ 0111.1111 = 00111111.00000001
If any of the multiplicand is a negative number, we have to first take the
two’s complement of that number and then perform the usual multiplication
as explained above. The sign of the result will depend on the number of two’s
complements that we have taken before performing the multiplication. In
hardware, this is achieved by doing the exclusive-OR of the two sign bits.
Addition, subtraction and multiplication are three, very important operations
performed by the I-ALU. Functionality of all these three operations has been
elab-orately described in this section. This study goes a long way in determining the
hardware architecture of the system. Besides these operations, multiply-accumulate
forms the heart of any DSP processor. Special emphasis has been laid on this in
the following sections. Division by numbers other than degenerate powers of 2 has
less occurrence in DSP related applications. Division is implemented only by the
shift operation because of the cost of using division with respect to time and area.
Section 3.2 addresses the important issue of rounding.
3.2
Outward/Directed Rounding
For most systems, although the internal buses of an ALU may be wide enough,
they have fixed sized registers. Input to this system is 16-bit with an internal bus
for them to be stored in these smaller sized registers. This reduction in word length
is achieved by the rounding operation. The bits of lower significance of the output
are suitably discarded depending on the rounding direction of the operand. This
introduces errors called precision rounding errors. However, Interval Arithmetic
makes sure that the exact result of the operation lies within the output interval.
Provision is made in this system to round the output interval values from 32 bits to
either24 bits or 16 bits depending on an input provided. This provision is made to keep the design flexible for different applications.
As discussed earlier, the proposed system is based on two’s complement
fixed-point number representation. Two’s complement number representation greatly
simplifies the rounding algorithm because an identical procedure needs to be
fol-lowed for rounding up positive and negative numbers. Also, a different but identical
procedure is maintained for rounding down positive and negative numbers. IEEE
standard defines four rounding modes viz. round towards nearest, round towards
zero, round towards positive infinity and round towards negative infinity. While
ap-plying to Interval Arithmetic, we are concerned with two cases: rounding towards
positive infinity and rounding towards negative infinity. The algorithms for
round-ing towards positive infinity and negative infinity are explained below with suitable
examples.
• Rounding towards negative infinity.
Rounding towards negative infinity refers to denoting a high precision number
by the greatest machine representable number of low precision but smaller in
value. In fixed-point two’s complement representation, this is achieved by
simply discarding the bits of lower significance. This algorithm holds true for
positive and negative numbers as illustrated by the following examples:
– Positive numbers
6.78125 in the 8:8 format is represented as 00000110.11001000.
– Negative numbers
-6.78125 in the 8:8 format is represented as 11111001.00111000.
In the 8:4 format, it is represented as 11111001.0011, which is -6.8125.
• Rounding towards positive infinity.
Rounding towards positive infinity refers to denoting a high precision number
by the smallest machine representable number of low precision but greater
in value. In fixed-point two’s complement representation, this is achieved by
performing a logical ‘OR’ on the bits of lower significance to be discarded and
then adding this bit to the number to be retained. Once again, this algorithm
holds true for positive and negative numbers as illustrated by the following
examples:
– Positive numbers
6.78125 in the 8:8 format is represented as 00000110.11001000.
In the 8:4 format, it is represented as 00000110.1101, which is 6.8125.
– Negative numbers
-6.78125 in the 8:8 format is represented as 11111001.00111000.
In the 8:4 format, it is represented as 11111001.0100, which is -6.75.
The above examples can be used for rounding of 32 bit fixed-point numbers in the
16:16 format to 24 bit fixed-point numbers in the 16:8 format. A similar procedure
is followed if the output has to be reduced to 16 bits from 32 bits. These examples
cover all aspects of the rounding algorithm. It is called the “Outward Rounding”
or “Directed Rounding” algorithm and is responsible for the validation of results
provided by interval analysis. The study of this algorithm makes it very simple to
design the hardware to implement outward rounding. The proposed design
man-ifests a separate rounding unit which takes inputs from the functional units and
provides the outputs of the system.
After getting acquainted with the design specifications of the ALU, I now proceed
Hardware Architecture
This chapter of the thesis contains a description of all the modules that constitute
the Interval-ALU. It gives details of the logic design at the gate level for the whole
system, one module at a time. The hardware model at the RTL level of abstraction
is built from these logic designs. Since throughput is the main performance metric
to be optimized, the logic is designed with reduction of the critical path delay in
mind. Several pipelined versions of the design are built along with the basic
non-pipelined one to improve the throughput. I begin with the top level block diagram
of the design and then go into the details of each module.
4.1
Overall Architecture
The overall architecture of the ALU can be seen in the block diagram shown in
Figure 4.1. The hardware model is divided into four parts, viz. the flag generator,
lower bound and upper bound modules, and the rounding unit. The flag generator
module is responsible for generating the control signals for the more complicated
multiplication operation. As the name suggests, the lower bound module and the
respectively. These two modules are independent of each other and hence operate in
parallel. The rounding unit implements the Outward Rounding algorithm explained
earlier.
Figure 4.1: Top Level Block Diagram of the ALU
The ALU is designed for operation on 16 bit input interval numbers in the two’s
complement form. The ALU has an input line that allows selection of the
multiply-accumulate mode, acc select. The ALU operates in the accumulate mode as long
as this line is held high. Another input line, rctl, determines the number of output
bits. The output is rounded to 24 bits when this line is held high and 16 bits when
this line is held low. Table 4.1 lists all the inputs to the ALU.
The ALU has two 24-bit output lines that represent the lower and upper bounds
of the resulting interval. Besides this, there are output lines to highlight the special
Table 4.1: Description of ALU Inputs
Input Description Bit Width
xL Lower bound on left-hand operand 16 bits
xU Upper bound on left-hand operand 16 bits
yL Lower bound on right-hand operand 16 bits
yU Upper bound on right-hand operand 16 bits
command Mathematical operation to be performed 3 bits
acc select Perform MAC when asserted 1 bit
rctl Width of output results (16 or 24-bits) 1 bit
two disjoint sets. A further explanation of these output lines is provided in the
following sections. Table 4.2 lists all the outputs of the ALU.
Table 4.2: Description of ALU outputs
Output Description Bit Width
zL Lower bound on result 24 bits
zU Upper bound on result 24 bits
next Valid results on output lines 1 bit
union Union of disjoint sets 1 bit
empty Intersection of disjoint sets 1 bit
4.2
Flag Generator Module
A major significance of the design is the reduction in the number of
multiplica-tions performed to evaluate the result of an interval multiplication operation. The
flag generator module forms the control logic for the multiplication operation. It
multi-plication. Based on the values of the input operands, it generates a 4 bit mul flag
which selects among the nine cases. Table 4.3 shows the case to be selected based
on the value on the mul flag.
Table 4.3: mul flag for the Multiplication operation
mul Case Result
0001 xL≥0;yL ≥0 [xLyL,xUyU]
0010 xL≥0;yL <0< yU [xUyL,xUyU]
0011 xL≥0;yU ≤0 [xUyL,xLyU]
0100 xL<0< xU;yL ≥0 [xLyU,xUyU]
0101 xL<0< xU;yU ≤0 [xUyL,xLyL]
0110 xU ≤0;yL≥0 [xLyU,xUyL]
0111 xU ≤0;yL<0< yU [xLyU,xLyL]
1000 xU ≤0;yU ≤0 [xUyU,xLyU]
0000 xL<0< xU;yL <0< yU [min(xUyL,xLyU),
max(xLyL,xUyU)]
The logic behind the generation of this flag is shown in Figure 4.2. Table 4.4
explains the generation of the various flag signals used in Figure 4.2. As shown in
the table, the flag signals are generated based on the values of the input operands.
These flags are used to obtain the mul signal based on which the inputs to the
multipliers in the main functional units are selected. This reduces the number of
multiplications to be performed.
4.3
Lower Bound and Upper Bound Modules
The lower bound and the upper bound modules have very similar hardware
structures. The primary difference is the inputs that drive its functional units.
Table 4.4: Flag Generation
Condition Flag Generated
xL ≥0 flag 1
yL ≥0 flag 2
xU ≤0 flag 3
yU ≤0 flag 4
(xL <0)&&(xU >0) flag 5
(yL <0)&&(yU >0) flag 6
Figure 4.2: Flag Generation Module
is concentrated in them. Both modules are independent of each others operation
which makes the working of the ALU dependent only on the input signals and not
perform the accumulate operation. This is an important feature of the ALU because
dot product which is implemented through the multiply-accumulate operation forms
the core requirement of any DSP processor. Figure 4.3 and Figure 4.4 display the
basic block diagram of each of the two modules. As seen in the block diagrams, the
two modules are very similar in architecture. However, it is important to note the
different status lines generated by different portions of the modules.
Figure 4.3: Block Diagram of Lower Bound Module
Figure 4.4: Block Diagram of Upper Bound Module
In a non-pipelined architecture, the circuit performance is determined by these
modules of the design. Critical path of a design may be defined as the single slowest
feasible path contained in the design. Greater the logic depth, longer is the critical
throughput. Since it forms the critical path, a significant effort needs to be put in
to optimize it. Pipelining provides the ideal solution to improve the throughput
problem and is discussed in detail beginning section 4.5. I will now go through each
of the individual blocks that make up the ALU as shown in Figure .
4.3.1
Functional Units and Control Logic
The combinational logic required to perform arithmetic operations on interval
numbers is located in the functional unit block. Apart from the difference in a few
output status lines, the hardware for functional units in each of the two modules is
identical. Each module has an adder/subtracter, a multiplier and other
combinato-rial logic to implement the set operations. Figure 4.5 shows the functional unit in
the lower bound module. It is important to note that the inputs to the adder are
the two lower bounds of the input operands, while the inputs to the subtractor are
the lower bound of the first input operand and the upper bound of the second. The
type of operation and the mul signal are used as controls to determine the outputs
of the functional units. The status line empty is generated by this portion of the
design to indicate the intersection of two disjoint sets.
Figure 4.6 shows the functional unit in the upper bound module. The inputs
to the adder and subtractor are different for this module than the lower bound
module. The upper bounds of both the input operands drive the adder, while the
upper bound of the first input operand and the lower bound of the second input
operand drive the subtractor. Once again, themul flag determines the inputs to the
multiplier and the outputs of the union and intersection set operations. The status
line union is generated by this module which indicates the union of two disjoint
sets.
The outputs of the arithmetic units are given to the special case multiplication
Figure 4.5: Lower Bound Module
4.3.2
Special Case Multiplication Block
The situation in which both input operands include zero in their intervals
rep-resents a special case, and is referred to as the ‘Special Case Multiplication’. In
contrast to the normal cases where interval multiplication requires two
Figure 4.6: Upper Bound Module
of the bounds in this special case. Hence we require a memory element which would
store a value and make it available for comparison with the next available value. It
requires two multiplications and one comparison to be performed to obtain each of
the two bounds. The hardware to determine the lower bound and the upper bound
is identical and is repeated in both the modules. A status linenextis taken from this
hardware architecture of this block. As seen in the diagram, the lef t/right out c
line coming in from the functional unit block is stored as lef t/right out rand used for comparison. The result of this comparison is used only for the interval
multi-plication operation when the mul flag is 0000. The minimum value is selected for
the lower bound module and the maximum value is selected for the upper bound
module. The special case multiplication may lead to synchronization problems if
not dealt with properly.
Figure 4.7: Special Case Multiplication
4.3.3
Multiply-Accumulate Block
A multiply-accumulator forms a very important part of the ALU, more so,
be-cause it is intended for use in DSP applications. DSP applications are characterized
Mathematically, the dot product can be calculated as:
a·b =Pni=0aibi
This operation can be readily performed by a multiply accumulate block. Figure
4.8 shows the hardware architecture of the multiply-accumulate block. As we can
see in the figure, it consists of an adder and a memory element which acts as the
accumulator. An input line acc select determines whether the output needs to be accumulated or not. When high, the block is in accumulate mode. The output of
this block is 32 bits long and is given to the rounding unit, where outward rounding
is implemented.
Figure 4.8: Multiply-Accumulate Module
4.4
Rounding Unit
The rounding unit forms a critical part of the Interval-ALU. This unit performs
the outward rounding, which guarantees the result of a computation to lie within
the output interval. The proposed design has provision to round a 32 bit output
for which it is going to be used. An input line,rctl, determines the number of bits to
which the output has to be rounded. When this line is high, the output is rounded
to 24 bits, else it is rounded to 16 bits. The outward rounding algorithm has been
discussed in section 3.2, hence this section concentrates on describing the hardware
to implement these rounding modes. Figure4.9shows the architecture for rounding
the output of the lower bound module.
Figure 4.9: Lower Bound Rounding
From the figure we can see that 8 or 16 bits of lower significance are discarded
based on therctl input, and the higher 24 bits or 16 bits are retained.
The rounding operation for the upper bound module is slightly more complicated
as compared to the lower bound module. In this case, the bits of lower significance
are not simply discarded, but are logically ‘OR’ed and the resultant bit is added to
the bits to be retained. If the rctl line is high, the last 8 bits are logically ‘OR’ed
last 16 bits are logically ‘OR’ed and added to the 16 bits of high significance. Figure
4.10 shows the architecture for rounding the output of the upper bound module.
Figure 4.10: Upper Bound Rounding
This completes the architecture of the entire ALU. From the architecture, it is
safe to say that the lower bound module and the upper bound module are the
critical modules in the design. Maximum logic is concentrated in them and hence it
is important to optimize them to a high degree. The following section concentrates
on optimizing these critical modules to improve the throughput of the system, by
pipelining the design to the highest degree. It provides details of several pipelined
4.5
Pipeline Architecture of the Design
This section presents several versions of the I-ALU, pipelined to various degrees
so as to achieve maximum throughput. Pipelining is a technique used to reduce the
critical path of the circuit and hence improve the speed at which the circuit can
operate. Increase in throughput for any circuit by implementing pipelining comes at
the cost of increased area, increased power dissipation and increased initial latency.
Although the technique of pipelining may portend to have more disadvantages than
advantages, in DSP systems where throughput is of prime importance, it goes a
long way in improving the efficiency of the system. As seen earlier, DSP systems
are characterized by several multiplication and addition instructions. A multiplier
involving combinational logic alone has a very high logic depth and is one of the
main candidates that forms the critical path. Hence, to reduce this logic depth, it is
necessary to pipeline the multiplier. However, the depth to which a multiplier can
be pipelined saturates at a certain point and then it becomes necessary to further
pipeline the design to improve its efficiency. The following sections discuss about
these pipelining techniques in further detail.
4.5.1
Need for Pipelining
In the proposed I-ALU design, synthesis results have shown that without any
level of pipelining, the lower bound module and the upper bound module form the
critical path in the design. Figure 4.11 shows the critical path in these modules.
The diagram shows that the critical path traverses some control logic, a multiplier,
followed by some more combinational logic and then an adder. The huge path forces
the design to work at lower clock frequencies. The main portion of the logic in this
critical path is that of the multiplier. There would be a significant decrease in the
clock period if this portion of the logic were to be pipelined. Implementation of a
Figure 4.11: Critical Path
4.5.2
Partially Pipelined Design
Partially pipelined architecture refers to replacing the multiplier formed by
com-binational logic alone with a pipelined multiplier architecture. The design tool
‘Syn-opsys’ provides several pipelined multiplier architectures in its library that can be
used [33]. A significant increase in the circuit performance is observed from the use
of these Design-ware IP blocks. However, this improvement in performance comes
at the cost of an increased area and power dissipation. Hence, a suitable trade-off
needs to be done to choose the best design. Figure 4.12 shows the architecture of
an non-pipelined multiplier, while Figure4.13, Figure 4.14, Figure 4.15 and Figure
4.16show an abstract architecture of a two-level, three-level, four-level and five-level
deep pipelined multipliers, respectively.
As seen in Figure 4.12, the cloud of combinational logic in a non-pipelined
Figure 4.12: Non-Pipelined Multiplier Architecture
Figure 4.13: Two-stage Pipelined Multiplier Architecture
Figure 4.14: Three-stage Pipelined Multiplier Architecture
thereby reducing the critical path. The subsequent figures show that as the number
of stages of pipelining increase, the cloud size between two consecutive registers
decreases. Hence each pipelined multiplier operates at a faster clock than the
previ-ous. However, when one such pipeline multiplier is included as a part of the circuit,
after a certain level of pipelining, the multiplier ceases to be a part of the critical
Figure 4.15: Four-stage Pipelined Multiplier Architecture
Figure 4.16: Five-stage Pipelined Multiplier Architecture
the critical path. Thus, a pipelined multiplier contributes in a big way to improve
the throughput of a system, but to obtain more performance improvement, further
pipelining of the design is required, as discussed in the following section.
4.5.3
Highly Pipelined Design
A highly pipelined design involves the use of a pipelined multiplier which
pro-vides the best results for this design. As results in chapter 5 indicate, the
perfor-mance of the ALU saturates for pipelined multipliers with more than 3 stages. All
architectures that employ pipelined multipliers of more than 3 stages operate at
ap-proximately the same clock frequency. This is because after 3 stages, the multiplier
ceases to be a part of the critical path, but the maximum logic depth is formed
by other combinational logic involving multiplexors and adders as shown in Figure
4.11. For further performance enhancement, it is necessary to reduce this cloud of
combinational logic. Thus, improvement in the performance of the circuit would
from using a three stage pipelined multiplier. Figure 4.17 shows the logic diagram
for this highly optimized design.
Figure 4.17: Highly Pipelined Architecture
The figure shows that several registers are included in the critical path to reduce
its length. The combinational logic between any two consecutive registers is reduced
to some control logic or simply an adder. This produces a significant decrease in
the overall clock period of the circuit as would be evident from the synthesis results.
The amount of combinational logic is maximum in multipliers, followed by adders
and then the control logic, amongst the combinational logic blocks used in this
design. Hence it is important to introduce a register in the path prior to the adder
to reduce the clock period. Introduction of this register brings down the clock period
from 17.73 ns to 3.55 ns. However, introduction of two more stages of pipeline in
the control logic further decreases the clock period to 3.25 ns, as shall be seen in
the following section. This improvement in performance from 3.55 ns to 3.25 ns, a
In conclusion, this chapter forms the heart of the work presented in this thesis.
It provides a detailed explanation of the architecture of the overall design. Every
module and its functionality has been explained comprehensively. One of the most
important optimizations for the circuit by way of pipelining has been discussed.
During this course, the role played by multipliers in governing the performance
of the ALU has been mentioned and finally, a highly pipelined architecture which
employs a three stage multiplier will be shown to be the best design based on
synthesis results. The next chapter provides statistical results obtained from various
simulation and synthesis runs. Throughput, area and power dissipation have been
Testing and Results
This chapter provides a comparison of timing, area and power dissipation made
between the different architectures of the I-ALU based on statistical data obtained
from simulation and synthesis results. Based on its significance for DSP
applica-tions, throughput has been considered as the prime performance metric to analyze
the various designs and come up with an optimum solution. Efforts to improve the
throughput of the system have an adverse effect on the area, latency and power
dissipation of each design. Tabulations and graphical aids have been used to show
the comparisons between these metrics for various architectures.
The functionality of the design was verified by running simulations in the Cadence
environment. Verilog HDL was used to capture the behavior of the ALU while
Synopsys was used for synthesis purposes. The 0.18µm ‘OSU Standard Cell Library’ was used while synthesizing the various modules. Synopsys Design Compiler [34]
was used for timing analysis and Synopsys PrimePower [35] was used to obtain the
5.1
Simulation Results
Verification of the operation was done by running simulations on the design
for 100% code coverage. All possible input combinations were considered and the
results from the simulation runs were compared with the expected values. A special
note needs to be made of three cases; the union of two disjoint sets requires two clock
cycles, the intersection of two disjoint sets results into a null set and the special case
multiplication requires two clock cycles to obtain the result as against one clock cycle
required for all other operations. Figure 5.1 shows these results for the addition,
subtraction and multiplication operations. As seen in the figure, the results of
addition, subtraction and multiplication were obtained after one clock cycle each
i.e. this design has a latency of 1. However, for the special case multiplication,
the output was obtained only after two clock cycles. The status line next goes
high indicating the occurrence of this case and informing that the actual output
is available in the next clock cycle instead of the current. Also, as soon as the
acc select line goes high, the ALU goes into the accumulate mode as can be seen from the simulation results.
Figure 5.2 shows the simulation results for the interval union and intersection
operations. Apart from the usual behavior of union and intersection, the two special
cases are worth noting. The union of two disjoint sets results in two output intervals
on two consecutive clock cycles, and this is appropriately indicated by the union
status line. For the case of intersection of two disjoint sets, the empty status line
goes high indicating a null set, in which case the interval output is considered invalid.
5.2
Synthesis Results
The different architectures described in the previous chapter have been