CHAPTER 1 INTRODUCTION

(1)

1

CHAPTER 1 INTRODUCTION

1.1 BACKGROUND

Fast fourier transform (FFT) processors have greatly increased the effectiveness of digital technology. The main objective of the project is to develop an area efficient FFT processor using radix 4 butterfly computations employing a multiplier-less distributed arithmetic algorithm (DAA) based look-up table (LUT). A comparision between radix 2 and radix 4 FFT processor in various fields of device utilization factors is tabulated.

Altera Quartus II 9.1 software is used for analyzing the time, power consumption and the device utilization for the proposed architecture. The simulated program is then implemented on to the EP2C35F672C6 FPGA board.

1.2 THE DISCRETE FOURIER TRANSFORM

The discrete fourier transform (DFT) converts a finite sequence of equally spaced samples of a function into list of coefficients. These coefficients are combinations of complex sinusoids, ordered by their frequencies, that has those same sample values. The sampled function is converted from its original time domain to its frequency domain. The fourier analysis of finite domain or periodic discrete time function of an N point signal {x[n], 0 ≤ n ≤ N − 1} is defined as,

N-1

X (k) = ∑ x (n) × e

-i2πkn/N

(1.1) n=0

(2)

2 Fig 1.1 DFT Terminology In other terms, N-1 X (k) = ∑ x(n) × WN-kn , 0≤ k ≤ N-1 n=0 Where WN = e j 2π N _{= cos (2π / N) + j sin (2π/ N)}

Equation (1.1) is known as N-point DFT analysis equation. The sequence of N complex numbers is transformed into an N periodic sequence of complex numbers. From fig 1.1, the time domain, x[ ] consists of N points running from 0 to N−1. In the frequency domain, the DFT produces two signals, the real part, written as ReX[ ], and the imaginary part, written as Im X[ ] together termed as X[k] . Each of these frequency domain signals run from 0 to N/2.

1.3 THE FAST FOURIER TRANSFORM

In the present scenario several methods are available for efficiently computing the DFT, like simultaneous equation method, by correlation process and so on. One such is the Cooley Tukey algorithm, named after J.W.Cooley and John Turkey. It is one of the most common fast fourier transform (FFT) algorithms.

(3)

3

It re-expresses the discrete fourier transform (DFT) of an arbitrary composite size N = N1N2 in terms of smaller DFTs of size N1 and N2. Recursively it is repeated to reduce the computation time O(N log N) for highly composite N. The FFT computes the DFT and produces exactly the same result with high speed computation rate. The most important difference is that the FFT is much faster than DFT.

It is observed from Equation (1.1) that for each value of k, direct computation of input X (k) involves N complex multiplications (out of which 4N real multiplications) and N−1 complex additions (including 4N-2 real additions). Consequently, for computing all input N value of the DFT requires N 2 complex multiplications and N 2−N complex additions. Direct computation of the DFT is basically limited primarily because its inability to exploit the symmetry and periodicity properties of the phase factor WN.

Symmetry property: Wk+n/2 N = -WkN (1.2)

Periodicity property: Wk+n N=WkN (1.3) To achieve the dramatic increase in the efficiency, it is necessary to decompose the DFT computation into successively smaller DFT computations. In this process, both the symmetry and the periodicity of the complex exponential are exploited (eq. 1.2, eq. 1.3).

There are several types of FFT algorithms, among these the most commonly used are the decimation in time (DIT) and the decimation in frequency (DIF).

(4)

4

1.3.1 DECIMATION IN TIME

Decimation in time algorithm (DIT) is the algorithm in which the decomposition is based on decomposing the input sequence x (n), into odd and even successive smaller subsequences. The principle of decimation in time is most conveniently illustrated by considering the different case of N for Radix 4.

Fig 1.2 Decimation in time (DIT) butterfly diagram

The principal difference between DIT and DIF butterflies is the difference in the order of calculation. In the DIF algorithm, the time domain data is twiddled before the sub-transforms are performed. In DIT, however, the sub-transforms are performed first, and the output is obtained by twiddling the resulting frequency domain data (fig 1.2). Considering an N-point signal x[n], the basic DFT equation is,

N-1 X (k) = ∑ x (n) WN nk n=0 N/4-1 N/4-1 N/4-1 X (k) = ∑ x (4n) WN4nk + ∑ x (4n+ 1) WN( 4n + 1 )k + ∑ x (4n + 2) WN(4n + 2 )k n=0 n=0 n=0 N/4-1 + ∑ x (4n + 3) WN(4n + 3)k (1.4) n=0 k = 0, 1, 2, 3... N – 1

(5)

5

Each of the sums is recognized as an N/4-point DFT. The derivation of the DIT radix-4 FFT is done by splitting the sums into subsequent smaller indexes (eq. 1.4).

Although the index k ranges over N values, k=0, 1, 2... N-1, (eq. 1.4) each of the sums are to be computed only for k=0,1,2,...,N/4-1, since they are periodic with period N/4. The transform X (k) is broken into four parts as shown below, N/4-1 N/4-1 N/4-1 X (k ) = ∑x(4n) WN/4 nk +WN k ∑x(4n+ 1) WN/4 nk + WN 2k _{∑ x(4n + 2) W} N/4 nk n=0 n=0 n=0 N/4-1 + WN3k ∑ x (4n + 3) WN/4nk n=0 N/4-1 N/4-1 N/4-1 X (k + N/4) = ∑x(4n)WN/4 nk _{– jW} N k ∑x(4n+1)WN/4 nk _{− W} N 2k_{∑x (4n+2)W} N/4 nk n=0 n=0 n=0 N/4-1 + j WN3k ∑ x (4n + 3) WN/4nk n=0 N/4-1 N/4-1 N/4-1 X (k + 2N/4) = ∑x(4n)WN/4 nk _{– W} N k_∑x(4n+1)W N/4 n + WN 2k _{∑x(4n+2) W} N/4 nk n=0 n=0 n=0 N/4-1 − WN3k ∑ x (4n + 3) WN/4nk n=0 N/4-1 N/4-1 N/4-1 X (k + 3N/4) = ∑x(4n)WN/4nk +jWNk∑x(4n+1)WN/4nk − WN2k ∑x(4n+2) WN/4nk n=0 n=0 n=0 N/4-1 − j WN3k ∑ x (4n + 3) WN/4nk n=0 Where k = 0 to N/4 -1. (1.5)

(6)

6

The inputs to the DIT have to be given in bit reversed format and the output form would be in normal order (fig 1.3) after performing FFT. DIT calculation reduces the complex multiplication that is needed for an N point DFT from N2 to 3(N/4) log4N.

Fig 1.3 Flow chart of DIT

1.3.2 DECIMATION IN FREQUENCY

In decimation in frequency (DIF) algorithm, the output sequence X(k) is divided into smaller and subsequences. The twiddle factor is multiplied only after arithmetic operation (fig 1.4). For the derivation purpose, the input sequence is divided into first half and the second half of the points. DIF format is followed for the FFT processor design.

Fig 1.4 Decimation in frequency (DIF) butterfly diagram

The radix 4 DIF-FFT algorithm decomposes the N-point DFT calculation into a number of 4-point DFTs (4 point butterflies).

(7)

7

Compared with direct computation of N-point DFT, 4 point butterfly calculation requires much less operations. The basic DFT is given by equation (1.6). The radix 4 DIF-FFT can be derived as shown in equation (1.7). X (k) = ∑n=0 x (n) WNnk (1.6) N/4-1 N/4-1 N/4-1 X (k) = ∑ x(n)WNnk + ∑ x(n+ N/4)WN( n + N/4 )k + ∑ x(n + 2N/4)WN(n + 2N/4 )k n=0 n=0 n=0 N/4-1 + ∑ x (n + 3N/4) WN( n + 3N/4 )k (1.7) n=0 k = 0, 1, 2, 3... N – 1

Considering the equation (1.6), it is very similar to an N/4-point FFT. However, it is not an FFT of length N/4 because the twiddle factor depends on N instead of N/4. To make this equation an N/4-point FFT, the transform X(k) is broken into four parts as shown below,

N/4-1 X (4 k) = ∑ {x (n)+ x (n+ N/4) + x (n + 2N/4) + x (n + 3N/4)} WN 0 WN/4 nk n=0 N/4-1 X (4 k + 1) = ∑ {x(n) − jx(n+ N/4) − x(n + 2N/4) + jx(n + 3N/4)} WN n WN/4nk n=0 N/4-1 X (4 k + 2) = ∑ {x(n)+ x(n+ N/4) + x(n + 2N/4) + x(n + 3N/4)} WN 2n WN/4 nk n=0 N/4-1 X (4 k + 3) = ∑ {x(n) + x(n+ N/4) + x(n + 2N/4) + x(n + 3N/4)}WN3nWN/4nk n=0 Where k = 0 to N/4 -1. (1.8)

(8)

8

It is observed that X(4k), X(4k+1), X(4k+2), and X(4k+3) are N/4-point FFT of y(n), y(n+N/4), y(n+2n/4), and y(n+3N/4), respectively.

The inputs are to be given in normal format unlike DIT process thus DIF is preferred in this project (fig 1.5). Similar to DIT process in radix 4 DIF butterfly algorithm, the N point FFT consists of log4(N) stages, and each stage consists of N/4-point radix 4 DIF butterflies. Radix 4 DIF butterfly calculation reduces the number of complex multiplication that is needed for an N-point DFT from N2 to 3(N/4) log4N (from 4N

2

to 3Nlog4N in terms of real multiplications).

Fig 1.5 Flow chart of DIF

1.4 APPLICATION OF FAST FOURIER TRANSFORM

Fast Fourier transforms are widely used for many applications in engineering, science, and mathematics. The basic ideas were popularized in 1965 by Cooley and Tukey. The uses of the FFT methods include spectral analysis, signal processing, fourier spectroscopy, image processing.

The importance of FFT is derived from the fact that in signal processing and image processing, working in frequency domain is computationally feasible as working in temporal or spatial domain. There are real time digital fourier methods, for which special purpose computers are now being devloped.

(9)

9

FFT is an integral part of the orthogonal frequency division multiplexing (OFDM) based wireless systems. In an OFDM system, a very high rate data stream is divided into multiple parallel low rate data streams. Each smaller data stream is then mapped to individual data sub carrier and modulated. The modulation of sub-carriers is performed by the FFT algorithm depending upon the number of subcarriers used in the wireless standard. Considering a specific wireless standard, the Wimax, FFT length varies from 64 to 2048.

1.5 LITERATURE SURVEY

Chu Yu and Mao-Hsu Yen proposed a FFT 128/256/512/1024/1536/2048 point architecture using hardware sharing mechanism. The operation is basically based on processing elements (butterfly unit), delay lines, buffers of various sizes and complex multipliers. Hardware sharing mechanism is used to reduce the memory usage by using the unused spaces. The concept of designing the butterfly units has been analyzed in detail and used in this project. The flow of inputs and outputs for the proposed processor are followed as studied from the literature. Although hardware sharing mechanism is used, it increases the latency of the processor. Hence concept of parallel processing of data was introduced.

Tanvir Ahmed, Mario Garrido, and Oscar Gustafsson, proposed a 512 point Radix 8 parallel pipelined feed forward FFT architecture. The FFT is parametrazible in word length, which can be selected according to the application. The number of complex multiplication, complex addition and buffers is reduced significantly by adopting parallel architecture for the FFT. Performing simultaneous arithmetic calculation decreases the time delay.

(10)

10

The idea of parallel mechanism is adapted from this paper to radix 4. The parallel mechanism helps to complete the processing within a single clock. This further increases the speed of FFT processor.

M.Rawski, M.Wojtyski, T.Wojciechowski, P.Majkowski presented a multiplier-less FFT processor design using distributed arithmetic algorithm (DAA). The complete DAA concept based on offset binary is clearly studied from this paper. For this project concept of look-up table is used with DAA with an intension to decrease the area. DAA is for the twiddle factor calculation with the input for FFT output generation.

1.6 ORGANISATION OF THE PROJECT

Chapter 2 describes in detail about the distributed arithmetic algorithm (DAA) used in output calculation process. The flow chart and the derivation of the look-up table (LUT) based DAA are presented.

Chapter 3 presents the basics of radix 4 algorithm and the floating point computation using binary scaling technique. The methodology and working of the processor are explained using butterfly diagrams aimed at radix 4 for 16 and 64 point. The complete block diagram for the processors proposed is specified.

Chapter 4 deals with the experimental results obtained by implementing the proposed architecture using the hardware specified. A comparison of various algorithmic implementation FFT processors is made.

Chapter 5 and 6 gives the overall conclusion of the project along with the future scope. The latter gives the journals, books and websites referred.

(11)

11

CHAPTER 2 DISTRIBUTED ARITHMETIC ALGORITHM

2.1 INTRODUCTION

Distributed algorithm is basically a computational algorithm that performs multiplication with the help of look-up table (LUT). It plays a vital role in embedding digital signal processing (DSP) functions in the family of FPGA devices. Distributed arithmetic algorithm specifically concentrates on the sum of products (also known as vector dot product).

2.2 FLOW DIAGRAM

Distributed arithmetic algorithm (DAA) comprises of three integral parts, namely shift register, accumulator and look-up table. The diagrammatic representation of DAA unit is given in fig. 2.1.

Fig 2.1 DAA flow diagram

The input data is fetched to the processor, after that its corresponding value is multiplied with twiddle factor retrieved from the look-up table, where the calculated values are pre-defined accurately. At each shift, the output is applied to a parallel adder, whose output is stored in an accumulator register.

(12)

12

The scaled accumulator output is the second input to the adder. Therefore the adder, register and scalar shall be referred to as a scaling accumulator unit. The process of multiplication is being replaced by the successive addition and look-up table.

2.3 LOOK UP TABLE

The arithmetic operations have now been reduced to addition, subtraction and binary scaling. With scaling by positive powers of 2, the actual implementation involves the shifting of binary coded data towards the most significant bit (MSB) and the least significant bit (LSB) are ignored by the factor, from which the data is being scaled.

Distributed arithmetic algorithm can be numerically represented as

(2.1) y (k) = output of the FFT for k samples

= Twiddle factor for input (constant) = input sequence (variable)

In equation 2.1 constant AK is the coefficient and the variable bkn is the prior samples of a single data source in filtering application. In the case of frequency transformation whether it is discrete fourier or fast fourier transform the constants are the sine/cosine basis functions (twiddle factors) and the variables are a block of samples from a single data source (inputs). Distributed arithmetic algorithm (DAA) is basically a bit-level rearrangement of the multiply and accumulate operation.

 



    













_









1 1 1 1 0

)

2 (

N n n K k kn k K k k k

b

A

b

A

y

(13)

13

It is an efficient technique for calculating the sum of products, vector dot product, inner product, and multiply and accumulate (MAC) values. MAC operation is very common in all digital signal processing algorithms.

2.4 DERIVATION PROCEDURE

The derivation of the DAA algorithm is extremely simple but its applications are extensively wide. The mathematics includes a mix of boolean and ordinary algebra and requires no prior preparation even for the logic designer.

Consider,

a. Let xk be a N-bits scaled two’s complement number that is, | xk | < 1

xk : {bk0, bk1, bk2……, bk(N-1) } where bk0 is the sign bit b. xk can be expressed as,

Expanding this part of the equation



      1 1 0 2 N n n kn k k b b x



   



















K k N n n kn k k

b

A

y

1 1 1 0

2 



_{ }







    













_









K k N n n k kn K k k k

A

b

A

b

y

1 1 1 1 0

2 



_



















      



















K k N N k k k k k k K k k k

A

b

A

b

A

b

y

1 ) 1 ( ) 1 ( 2 2 1 1 1 0

2

2 

2

(14)

14

(2.2) Rearranging and regrouping equation (2.2) yields the final equation

(2.3) DAA hides the explicit multiplication and it can be sensed that DA is a very efficient means to mechanize computations, which are dominated by inner products.















_ _



 















_ _



 















_ _



 



1



1 2 2 1 1 1 2 1 2 2 2 22 1 2 21 1 1 1 1 2 1 12 1 1 11 0 2 20 1 10 2 2 2 2 2 2 2 2 2                                             N K N K K K K K N N N N K K A b A b A b A b A b A b A b A b A b A b A b A b y     







 













 











 







 





 







 1 1 2 1 2 1 1 1 2 2 2 22 1 12 1 1 2 21 1 11 0 2 20 1 10 2 2 2                                     N K N K N N K K K K K K A b A b A b A b A b A b A b A b A b A b A b A b y     







 













 











 







 





 







 1 1 2 1 2 1 1 1 2 2 2 22 1 12 1 1 2 21 1 11 0 2 20 1 10 2 2 2                                     N K N K N N K K K K K K A b A b A b A b A b A b A b A b A b A b A b A b y     







   



















1 1 2 2 1 1 0

)

2 (

N n n K Kn n k n K k k k

A

b

A

b

A

b

A

b

y



 



          _     1 1 1 1 0) 2 ( N n n K k kn k K k k k b A b A y

(15)

15

2.5 BENEFITS OF DISTRIBUTED ARITHMETIC ALGORITHM

 The advantages of DAA are the best exploited in data path circuit designs.

 Area saving from using DAA can be up to 80% and seldom less than 50% in digital signal processing hardware designs.

 DAA is old technique that has been revived by the wide spread use of field programmable gate arrays (FPGA) for digital signal processing (DSP).

 DAA efficiently implements the MAC using basic building blocks (look-up tables) in FPGAs.

Though distributed arithmetic algorithm has many inspiring characteristics, it’s usage is limited in DSP applications. The dream of constructing a multiplier-less circuit is now a reality with the arrival of DAA. Moreover, using DAA greater speed benefits can be attained, especially for higher radix calculations. In DAA, multiplications are reordered and mixed such that the arithmetic becomes distributed through the structure rather than being lumped.

(16)

16

CHAPTER 3 DISTRIBUTIVE ARITHMETIC ALGORITHM BASED FFT

PROCESSOR

3.1 OVERVIEW OF RADIX 4

The term radix is the size of FFT decomposition. The butterfly (fig 3.1) diagram for radix 4 algorithm consists of four inputs and four outputs. A stage in radix 4 is half of radix 2. For single radix FFTs, the transform size must be a power of the radix. The FFT length is 4N, where N is the number of stages and 4 represents the radix applied. The method proposed in this project is formulated using DIF logic. The main reason for opting DIF instead of DIT is that the inputs are not to be given in bit reversed format. Bit reversal is just what it sounds identical to reversing the bits in a binary word from left to right. Therefore the MSBs become LSBs and the LSBs become MSBs.

The inputs are passed on along stages with computations with twiddle factor WNk where k is from 0 to N/4 -1.Twiddle factors are the coefficients used to combine results from a previous stage to form inputs to the next stage.

(17)

17

3.1.1 NEED FOR RADIX 4

Implementation of butterfly block in a DSP processor requires selection of radix first. Several FFT algorithms have been proposed such as radix 2, radix 4, radix 8 and several other higher order radices FFT. Designer has the freedom to choose the algorithm based on the application that needs to be developed.

For applications, where speed is not the critical factor, the designer can proceed with the use of radix 2 processor. Nevertheless it is not always the same case for high speed processor, which requires the usage of higher order radix. But blindly choosing higher order radix results in an inefficient processor, because of the higher radix algorithms the number of computations reduces and speed improvement factor increases. Although the internal complexity is so high, it becomes difficult for the designer to implement and debugging requires lot of time. Hence it is always better to implement the radix 4, since it provides a good trade-off between the speed and complexity of the design.

3.1.2 ALGORITHM

Complex numbers are used for the radix 4 FFT processor design. The complex number consists of real and imaginary part. The calculation process for the input imaginary and real parts are done separately and then combined together to form the real and imaginary parts of output. The twiddle factors are also given as complex numbers. After performing the complex multiplication and further simplification of equation (1.7), the calculation for real and imaginary parts are given by equation (3.1)– equation (3.8),

(18)

18

op1_im = ip1_im + ip2_im + ip3_im + ip4_im. (3.2) op2_re = [(ip1_re − ip3_re) + (ip2_im + ip4_im)] * tw1_re −

[(ip1_im – ip3_im) – (ip2_re – ip4_re)] * tw1_im. (fig 3.2)(3.3) op2_im = [(ip1_im − ip3_im) − (ip2_re − ip4_re)] * tw1_re +

[(ip1_re – ip3_re) + (ip2_im – ip4_im)] * tw1_im. (fig 3.2)(3.4) op3_re = [(ip1_re + ip3_re) − (ip2_re + ip4_re)] * tw2_re −

[(ip1_im+ip3_re) – (ip2_im – ip4_im)] * tw2_im. (3.5) op3_im = [(ip1_im + ip3_im) − (ip2_im + ip4_im)] * tw2_re +

[(ip1_re + ip3_re) – (ip2_re + ip4_re)] * tw2_im. (3.6) op4_re = [(ip1_re − ip3_re) − (ip2_im − ip4_im)] * tw3_re −

[(ip1_im – ip3_im) + (ip2_re – ip4_re)] * tw3_im. (3.7) op4_im = [(ip1_im − ip3_im) + (ip2_re − ip4_re)] * tw3_re +

[(ip1_re – ip3_re) – (ip2_im – ip4_im)] * tw3_im. (3.8) where ip and op are the input and output parameters respectively, tw=twiddle factor and re=Real, im= Imaginary.

Fig3.2 Block diagram for a single output (without using DAA)

(19)

19

Similarly in the above depicted way other outputs are calculated. The first output op1_re and op1_im does not have any computation with twiddle factor as W0=1.

If the radix 4 algorithm is implemented based on equations (3.1), (3.2), (3.3), (3.4), (3.5), (3.6), (3.7), and (3.8), one step of the radix 4 algorithm will require more arithmetic operations than two steps of the radix 2 algorithm, because some partial results are computed more than once. However, such partial results are identified and computed for the processor only once. One step of the radix 4 algorithm requires fewer arithmetic operations than two steps of the radix 2 algorithm and the total cost of the radix 4 algorithm can be lower than the radix 2 algorithm.

3.1.3 COMPUTATIONAL COST

For radix 4, there are 12 real multiplications and 22 real additions. It is equivalent to 3 complex multiplications and 8 complex additions. Since one complex multiplication requires four real multiplications plus two real additions, one complex addition requires two real additions

(a+bj) × = a×c – b×d (real part: 2 real multiplications, 1 real addition) (c+dj) a×d + c×b (imaginary part: 2 real multiplications, 1 real addition) (a+bj)+(c+dj) = a+c (real part: 1 real addition)

b+d (imaginary part: 1 real addition) As mentioned before, in radix 4 DIF butterfly algorithm, the N-point FFT consists of log4 (N) stages, and each stage consists of N/4-point Radix-4 DIF butterflies.

(20)

20

Radix 4 DIF butterfly calculation reduces the number of complex multiplication needed for an N-point FFT from N2 to 3(N/4)log4N (from 4N2 to 3Nlog4N in terms of real multiplications). For example, the number of real multiplications needed for a 16-point FFT is reduced from 256 to 24. Similarly for 64 point FFT, from 4096 to 144. The improvement of the radix 4 DIF-FFT butterfly algorithm over the direct calculation of the DFT is approximately 284 times.

3.2 BINARY SCALING TECHNIQUE

Binary scaling technique was first used in the 1970s and 80s for real time computing. There were many mathematically intensive applications, which were often commented with the binary scaling of the intermediate results. It is both faster and more accurate than directly using floating point instructions. In the FFT processor system, this technique is used for floating point number calculation of twiddle factor. The twiddle factor values are all in floating point with real and imaginary parts. A common way to use integer arithmetic to simulate floating point is to multiply the coefficients by 25610. Using binary scientific notation, this will place the binary point at 10016. For instance,

0.707 = 181. 0.382 = 98. (At 10016) Suppose it is multiplied,

181 * 98 = 17738.

To represent back in 10016 format, 17738/28 = 70 or (by shifting by 8 bits) Converting it back to floating point gives, 0.273 (0.707 * 0.382 = 0.270)

(21)

21

The above mentioned procedure is the basic operation of binary scaling technique. For the FFT processor, 1000000002 is used as the binary point for the twiddle factor. For representing back the number in 1000000002 shifting the number by 8 bits is followed. The result is not 100% accurate, but some kind of approximation is followed.

The approximation process is carried out by rounding off the numbers to the lower limits. For example 0.45 and −0.45 to 0 and −1 respectively. In this way floating point twiddle factors are converted into integer and arithmetic operations are performed with this and inverted back to output by shifting.

3.3 DESIGN OF THE PROCESSOR

A radix 4 FFT using DAA with binary scaling technique in DIF format is designed. The inputs (both real and imaginary parts) have to undergo arithmetic operations like addition and subtraction for the FFT outputs (fig 3.3).

(22)

22

3.3.1 16 POINT RADIX 4 PRELIMINARIES

A 16 point FFT processor can contain 16 possible combinations of inputs each of 4 bits with real part 4 bits and imaginary part 4 bits. The inputs can vary from 0000 to 1111(F). The number of stages of a processor is calculated by the formulae log4N, where N gives the number of stages. For radix 4 16 point 2 stages are required.

3.3.1.1 DESIGN METHODOLOGY

The complete FFT takes place as a parallel processing. Using parallel performance all the inputs are given in single clock. The processed outputs are also generated within a single clock cycle. This is done by using synchronized registers at the output end. There exists a clock delay for processing the output from look-up table. The inputs are of size 4 bits to stage1. Each of the twiddle factors are of size 10 bits represented in 1000000002 format. After processing with twiddle factors in stage1 an intermediate result of DAA- LUT of 13 bits is got. After shifting by 8 we get the final stage1 intermediate output of 6 bits (including 5+1 for overflow). The intermediate output is given to stage2 where DAA in not necessary since the twiddle factor values corresponds to W0 = 1. Hence only arithmetic operations are performed producing output of 10 bits (9+1 for overflow bit). The inputs to the stage1 Radix 4 are to be given with interval of 4 (ip1_re, ip1_im; ip5_re, ip5_im; ip9_re, ip9_im; ip13_re, ip13_im) to a single block. Similarly for stage1 4 Radix4 blocks are used. The intermediate result of stage1 is considered to be y. For the next stage the inputs are combined with length of 4 (y1_re, y1_im; y2_re, y2_im; y3_re, y3_im; y4_re, y4_im) to get bit reversed outputs at the end of stage2.

(23)

23

3.3.1.2 BLOCK DIAGRAM OF 16 POINT RADIX 4 FFT PROCESSOR

The FFT process is summarized as follows:

1. First stage: The input data are loaded in the normal addressing mode. The radix 4 DIF butterfly calculation is repeated for N/4 times on the input data. In the first stage, the twiddle factors in the first block are equal to 1. The other twiddle factor values equal to (W16k, W162k, W163k) (k = 1, 2, 3,).

2. Last stage: The Radix 4 DIF butterfly calculation is repeated for again N/4 times. The twiddle factors are all equal to 1. As a result, there is no multiplication in the last stage similar to first stage first block.

Fig 3.4 Block diagram for 16 point radix 4 FFT processor

Thus the complete block diagram of radix 4 (fig 3.4) with 16 inputs and 16 outputs consists of two stages with DAA being used for computation with twiddle factors.

(24)

24

3.3.2 64 POINT RADIX 4 PRELIMINARIES

A 64 point radix 4 FFT processor consists of log464 = 3stage (logrN, where r, N represents radix and number of points respectively). The inputs can have a combination of 64 values ranging from 000000 to 111111. The complexity of 64 point is high when compared to 16 point but for system with practical application requires high data rate in today’s technology. Thus higher point processor are being employed everywhere even with higher degree of complexity. By making certain manipulations, this complexity in design can be reduced to certain limit by reusing the blocks with similar twiddle factors.

3.3.2.1 DEISGN METHODOLOGY

Similar to the design flow of 16 point, the inputs are of size 4 bits to stage1. Each of the twiddle factors are of size 10 bits represented using binary scale technique in 1000000002 format. After processing with twiddle factors in stage1, the intermediate result of DAA- LUT is of 13 bits. After shifting the intermediate value by 8, we get the final stage1 output of 6 bits (including 5+1 for overflow). The stage1 output is given to stage2 which also employs DAA for producing an output of 8 bits. For the last stage DAA algorithm is not necessary since the twiddle factor values corresponds to W0. Hence, arithmetic operations alone are performed in final stage producing output of 11 bits (10+1 for overflow bit).

The inputs to the stage1 Radix 4 are to be given with interval of 16 (ip1_re, ip1_im; ip17_re, ip17_im; ip33_re, ip33_im; ip49_re, ip49_im) to a single block. For stage1 16 Radix4 blocks are used.

(25)

25

The intermediate result of stage1 is considered as y. For the next stage the inputs are to be combined with interval of 4 (y1_re, y1_im; y5_re, y5_im; y9_re, y9_im; y13_re, y13_im). For the next stage the inputs are to be combined with length of 4 (x1_re, x1_im; x2_re, x2_im; x3_re, x3_im; x4_re, x4_im) to get bit reversed outputs at the end of stage3.

3.3.2.2 BLOCK DIAGRAM OF 64 POINT RADIX 4 FFT PROCESSOR

Fig 3.5 Block diagram of 64 point radix 4 FFT processor

The FFT for 64 point is summarized as follows:

1. First stage: The input data are loaded in the normal addressing mode. The radix 4 DIF butterfly calculation is repeated for N/4 times on the input data (16 times). In the first stage, twiddle factors in the first block are equal to 1. The other twiddle factor values equal to (W64k, W642k, W643k) (k = 1, 2, 3… 15).

(26)

26

2. Second stage: The previous stage outputs are taken at an interval of 4 to a single radix 4 block. Similar to first stage there is a requirement of various twiddle factors for different block (W64k, W642k, W643k) (k = 4, 8, 12).

3. Last stage: The radix 4 DIF butterfly calculation is repeated for N/4 times. The twiddle factors all equal to 1 thus making DAA unnecessary for this stage. As a result, there is no multiplication similar to first stage first block. The result is generated after simple arithmetic operations of previous stage input.

Thus the complete block diagram and flow of inputs from one stage to another is given in fig 3.5, which consists of 3 stages with normal inputs and bit reversed outputs.

(27)

27

CHAPTER 4 SIMULATION RESULTS

4.1 RADIX 4 16 POINT FFT PROCESSOR

The radix 4 16 point FFT processor is designed and simulated successfully using Altera Quartus II 9.1 software. The various results that are drawn out from the simulation are detailed below.

4.1.1 COMPILATION REPORT

Fig 4.1 Compilation report for radix 4 16 point FFT processor

The compilation report (fig 4.1) states about the logical elements, combinational functions and total registers being used. Combinational logic circuits implement boolean functions and these circuits are functions of input only.

(28)

28

Sequential logic circuit has inputs along clocks. The main difference between sequential circuits and combinational circuits is that sequential circuits compute their output based on input and state, and that the state is updated based on a clock. Combinational logic circuits implement boolean functions, so they are functions only of their inputs, and are not based on clocks. The compilation report mentions that radix 4 16 point requires 1330 combinational elements and 508 sequential elements out of available 33,216.

4.1.2 REGISTER TRANSFER LEVEL VIEWER

Fig 4.2 Register transfer level of radix 4 16 point FFT processor

The radix 4 16 point RTL is shown in fig 4.2. The 16 point radix 4 has 2 stages. The RTL viewer provides a hierarchy list that displays a representation of the project hierarchy and a schematic view that displays the components of the design element that are wanted to be examined.

(29)

29

4.1.3 TIMING ANALYZER

Fig 4.3 Timing analyzer report for radix 4 16 point FFT processor

There are various terms that are used within the report. The region just before the clock edge is called setup time (tsu). The time between the generation of clock and the output is given as clock to output delay (tco). The region just after the clock edge is called hold time (th). For setup time, the data signal must not change for a given time before the clock edge. The clock-setup gives the operational frequency of the FFT processor designed. Altogether these values are given in the timing analyzer report (fig 4.3).

4.1.4 POWERPLAY ANALYZER

The power play analyzer (fig 4.4) reveals the total thermal, dynamic, static and I/O thermal power dissipation.

(30)

30

Fig 4.4 Powerplay power analyzer summary for radix 4 16point FFT processor

Static power is power consumed while there is no circuit activity. Dynamic power is power consumed while the inputs are active. When inputs have ac activity, capacitances are charging and discharging, as a result the power increases. The total thermal power dissipation is the algebraic sum of static thermal power dissipation and I/O thermal power dissipation which is 161.23 mw here.

4.1.5 CHIP PLANNER - FAN IN AND FAN OUT

The chip planner (fig 4.5) provides a visual display of chip resources. It shows logic placement, logic lock regions, relative resource usage, detailed routing information, fan-ins and fan-outs, path between registers and high-speed transceiver channels and many more.

(31)

31

Fig 4.5 Chip planner – fan in and fan out radix 4 16 point FFT processor

Fan-in is the number of inputs a gate can handle. Fan-out is a term that defines the maximum number of digital inputs that the output of a single logic gate can feed. The fan in is for 4209 nodes with 13265 connections. The Fan out is for 4209 nodes with 10829 connections.

4.1.6 SIMULATION WAVEFORM

Table 1 Sample I/O for 16 point FFT processor INPUT 1+1j (IP1) 2+1j (IP2) 2+2j (IP3) 2+0j (IP4) 2+3j (IP5) 1+1j (IP6) 1+1j (IP7) 0+0j (IP8) OUTPUT 20+19j (OP1) 6+1j (OP2) 6+13j (OP3) 0+7j (OP4) 3-6j (OP5) 3-2j (OP6) 1+0j (OP7) 1+0j (OP8) INPUT 1+5j (IP9) 1+1j (IP10) 1+2j (IP11) 0+0j (IP12) 4+1j (IP13) 1+0j (IP14) 1+1j (IP15) 0+0j (IP16) OUTPUT -3-2j (OP9) -5+0j (OP10) -1+4j (OP11) -7+6j (OP12) -5-7j (OP13) -1-7j (OP14) -1-7j (OP15) -1-3j (OP16)

(32)

32

The output waveform for radix 4 is given by fig 4.6. The input samples are specified for a period cycle and the outputs are generated at the second positive triggered edge (because of delay). The inputs are given through the vector waveform file. The waveform file is uploaded to the simulator tool and the functional simulation netlist is generated for the input.

Fig 4.6 Simulation waveform for radix 4 16 point FFT processor

4.2 RADIX 4 64 POINT FFT PROCESSOR

Similar to 16 point, 64 point is also designed and simulated. The results corresponding to it are given in the following sections.

4.2.1 COMPILATION REPORT

From fig 4.7 the total combinational and logical functions that have been employed is increased in the case of 64 point (Doubled) and the total pins required is increased from 449 to 463 in the case of 64 point.

(33)

33

Fig 4.7 Compilation report for radix 4 64 point FFT processor

4.2.2 REGISTER TRANSFER LEVEL VIEWER

(34)

34

The radix 4 64 point butterfly diagram is shown in fig. 4.8. The main difference between 16 point and 64 point is that 64 point has three stages whereas 16 point has only two stages.

In the fig 4.8, the last two stages are combined so that a single 16 point radix 4 can be used in place of two stages of 4 point radix 4. The two stage computations are replaced by a single block. Although merging units do not alter the number of elements, it reduces the delay significantly.

4.2.3 TIMING ANALYZER

The timing analyzer gives the setup time, hold time and clock to output worst case scenarios and it also gives the clock setup time. The timing analyzer for 64 point is given in fig 4.9.

(35)

35

4.2.4 POWERPLAY ANALYZER

Fig 4.10 Powerplay power analyzer summary for radix 4 64 point FFT processor

The powerplay power analyzer report is similar to that of the Radix 4 16 point. The powerplay power analyzer accepts information from a source and analyzes it with several factors affecting the power consumption, to produce a high quality power estimate. It produces a power consumption profile representative of the expected design utilization after simulation. The total thermal power dissipation being 163.26mW and I/O thermal power dissipation is 83.16mW.

4.2.5 CHIP PLANNER - FAN IN AND FAN OUT

The chip planner portraying fan in and fan out are given in fig 4.11. The fan in is for 4209 nodes with 13265 connections. The fan out is for 4209 nodes with 10829 connections.

(36)

36

Fig 4.11 Chip planner - fan in and fan out for radix 4 64 point FFT processor

4.2.6 SIMULATION WAVEFORM

(37)

37

Fig 4.13 Simulated waveform output for 64 point radix 4 FFT processor (2)

The simulated waveform for 64 point consisting of real part I/O is given in fig 4.12, 4.13.

4.3 EP2C35F672C6 DE2 BOARD

Table 2 Device specifications

The simulated outputs both 16 point and 64 point are implemented using Altera EP2C35F672C6 DE2 board.

CYCLONE II EP2C35F672C6 CORE VOLTAGE 1.2 V LOGICAL ELEMENTS 33216 USERS I/O’S 475 LOGICAL REGISTERS 33216 PACKAGE FBGA PIN COUNT 672 SPEED GRADE 6

(38)

38

Following the immensely successful first-generation Cyclone device family, Altera Cyclone II FPGAs extend the low-cost FPGA density range to 33,216 logic elements (LE) and provide up to 475 usable I/O pins and 672-Pin Fine-Line Ball Grid Array (FBGA) with 0.48 Mbits of embedded memory (table 2).

Cyclone II FPGAs are manufactured on 300-mm wafers using Taiwan Semiconductor Manufacturing Company’s (TSMC) 90-nm low-k dielectric processor to ensure rapid availability at low cost. By minimizing the silicon area, Cyclone II devices can support complex digital systems on a single chip at a cost competitive that of ASICs. Unlike other FPGA vendors who compromise power consumption and performance for low-cost, Altera’s latest generation of low-cost FPGAs—Cyclone II FPGAs, offer 60% higher performance and half the power consumption of competing 90-nm FPGAs. The low cost and optimized feature set of Cyclone II FPGAs make them ideal solutions for a wide array of automotive, consumer, communications, video processing, test and measurement, and other end-market solutions.

4.3.1 BOARD IMPLEMENTATION

The radix 4 16 point and 64 point are implemented on the board. The pin assignments are made using pin planner tool. After I/O pin assignment verification, compilation of code is done to develop the .sof file. It is a binary file generated by the Compiler's assembler module or by the makeprogfile command-line utility. A SOF contains the data for configuring all SRAM-based Altera devices supported by the Quartus II software, using the Programmer. The programmer is initiated and .sof file is configured to the board via USB blaster. When the process gets completed, inputs are given through the board and corresponding outputs are generated.

(39)

39

Fig 4.14 Board implementation of radix 4 16 point and 64 point FFT processor

The switches are used for inputs and clock signal. The LEDs both green and red are used for displaying the outputs (fig 4.14). The outputs are produced at the positive triggering edge of the clock. A single clock delay is formed for 16 point whereas in the case of 64 point 2 clock delays are generated for the parallel processing radix 4 FFT processor.

4.4 RESULT ANALYSIS

After simulation the different outcomes of the processor are tabulated (table 3). From the tabulation it can be seen that there is a significant decrease in area usage in the case of radix4 16 point when compared to radix2 16 point. As already discussed the computational cost is greatly reduced for radix4 owing its advantage over radix 2. In radix 4 different types of FFT processor are used with various algorithms such as complex adders and multipliers, modified booth multiplier and distributed arithmetic algorithm (DAA). Among these, the area usage is very less for DAA compared with combinational and sequential element usage.

(40)

40

Table 3 Consolidated results of various FFT processors

R A D IX 2 - C O MPLE X A D D ER S & MUL TIPLIER 16 poi n t R A D IX 4 - C O MPLE X A D D ER S & MUL TIPLIER 16 poi n t R A D IX 4 - B O O TH MOD IFIED A LG O R ITH M 16 poi n t R A D IX 4 - D IST R IB U TE D A R TH METIC A LG O R ITH M 16 poi n t R A D IX 4 - D IST R IB U TE D A R TH METIC A LG O R ITH M 64 poi nt OPERATOR (ADD) 192 184 520 177 262 OPERATOR (MUL) 72 128 - - - PIN (475) 449 (94.52%) 449 (94.52%) 449 (94.52%) 449 (94.52%) 463 (97.47%) LOGICAL REGISTER (33216) 1040 (3.13%) 553 (1.66%) 559 (1.68%) 507 (1.52%) 770 (2.31%) COMBINATIONAL COMPONENTS (33216) 2044 (6.15%) 1910 (5.75%) 2346 (7.06%) 1333 (3.98%) 2778 (8.36%) WORST-CASE TSU (ns) 12.251 15.767 17.235 12.982 10.516 WORST-CASE TCO (ns) 10.287 10.735 10.687 10.690 8.802 WORST-CASE TH (ns) -0.156 -1.279 -2.278 -1.737 -1.570 CLOCK SETUP FREQUENCY (MHZ) TIMEPERIOD (ns) 139.68 7.159 202.88 4.929 206.23 4.849 222.6 5.872 182.52 5.478 THROUGHTPUT (b/s) 4.4 x 10 10 6.4 x 1010 6.5 x 1010 6.8 x 1010 4.6 x 1010

(41)

41

Efficiency of an algorithm can be best estimated by using the throughput value. The throughput gives the number of output bits produced per sec.

Throughput=clock frequency × number of outputs per clock cycle × number of bits for each output (parallel processing)

To get more clear view of the efficiency of Radix 4 16 point using DAA the results are plotted.

Fig 4.15 Result analysis plot

From the fig 4.15, the combinational elements value 2044 (Radix2 16 point) > 1333 (Radix4 16 point DA), sequential elements value 1040 (Radix2 16 point) > 507 (Radix4 16 point DA), clock frequency 139.68 (Radix2 16 point) < 222.6 (Radix4 16 point DA), throughput 4.46×1010 (Radix2 16 point) < 6.641×1010 (Radix4 16 point DA). Thus for Radix4 DA FFT processor the overall element requirement gets reduced as combinational and sequential elements decreases. The speed increases as the clock frequency increased which intern has a greater throughput compared to other algorithm with same function of performing FFT.

0 200 400 600 800 1000 COMBINATIONAL ELEMENTS ( × 10^10 ) SEQUENTIAL ELEMENTS CLOCK FREQUENCY (Mhz) THROUGHPUT (x10^10 Gbps) RESULT ANALYSIS RADIX 4- DISTRIBUTED ARITHEMTIC ALGORITHM 64 POINT RADIX 4- DISTRIBUTED ARITHEMTIC ALGORITHM 16 POINT

RADIX 4- BOOTH MULTIPLIER ALGORITHM 16 POINT

RADIX 4- COMPLEX ADDERS AND MULTIPLIERS 16 POINT

RADIX 2- COMPLEX ADDERS AND MULTIPLIERS 16 POINT

(42)

42

CHAPTER 5 CONCLUSION AND FUTURE SCOPE

In this project, a novel area efficient radix 4 FFT parallel processor for 16 and 64 points has been developed. The processor uses distributed arithmetic algorithm (DAA) based look-up table (LUT) for multiplier-less processor. The floating point computation of twiddle factor is made ease with the help of binary scaling technique. The FFT processor is designed, implemented and simulated using Altera EP2C35F672C6 FPGA device. A comparison of various scenarios is carried out with the proposed method. The results show that the proposed method has greater efficiency in terms of its complex adders and multipliers.

Future research work shall include implementing the higher radix FFT processor with the proposed DAA based architecture. For instance same algorithm can be extended to higher radix with higher data points such as 512, 1024, 4096 and 8192 for orthogonal frequency division multiplexing (OFDM) real time application. Similarly it can be extended to the trending technological concepts like orthogonal frequency-division multiple access (OFDMA) and single carrier frequency division multiple access (SC-FDMA) for long term evolution (LTE) and worldwide interoperability for microwave access (WIMAX) applications. The implemented new algorithm gives an ease way to increase the number of points of FFT by imposing simpler modification.

(43)

43

CHAPTER 6 REFRENCES

[1]Amaresh Kumar, Dr. Manish Mishra, Roopak Kumar Verma (2014 October) “Design a Parallel Pipeline Radix-4 FFT Architecture” published on International Journal of scientific research and management (IJSRM), Volume 2 Issue 10, Pages 1473-1476.

[2]Chu Yu and Mao-Hsu Yen (2013 October) “A 128/512/1024/2048-point pipeline FFT/IFFT architecture for mobile WiMAX” in Proc. 2nd IEEE Global Conference Consumer Electron, Pages 243-244.

[3]Eleanor Chu, Alan George (2000) “Inside the FFT Black Box: Serial and Parallel Fast Fourier Transform Algorithms”, Florida,CRC press LCC. [4]Nagakishore Bhavanam.S, Keerthi.M, Vasujadevi Midasala, Jeevan Reddy.K. (2012 November) “FPGA Implementation Of Distributed Arithmetic For FIR Filter” in International Journal of Engineering Research & Technology (IJERT), Vol. 1 – Issue 9, Pages 1-8.

[5]Prabhu Kumar.K, Suhasini.M, & Srinivas.P “Multiplier Design and Performance Estimation with Distributed Arithmetic Algorithm” (2012) in International Journal of Computer & Communication Technology (IJCCT), Vol.-3, Issue – 4, Pages 7-12

[6]Rawski.M, Wojtyski.M, Wojciechowski.T and Majkowski.P (2007 june) “Distributed Arithmetic based implementation of fourier transform targeted at FPGA Architectures” in 4th International Conference MIXDES 2007 Ciechocinek, Poland, Pages 152-156.

[7]Santosh.R, Lalitha Bhavani.K.V (2014 October) “Area Efficient Higher Order FIR Filter Design using Improved Distributed Arithmetic with Lookup Tables” in International Journal of Engineering and Advanced Technology (IJEAT) Volume-4 Issue-1, Pages 213-216.

[8]Siva Kumar Palaniappan and Tun Zainal Azni Zulkifli (2007) “Design of 16-point Radix-4 Fast Fourier Transform in 0.18μm CMOS Technology” in American journal of applied scienes, Pages 570-575.

(44)

44

[9]Sreekanth Yadav.K, Charishma.V, Neelima koppala (2013) “Design and simulation of 64 point FFT using Radix 4 algorithm for FPGA Implementation” in International journal for Engineering Trends and Technology volume 4- issue 2, Pages 109-113.

[10]Xiaochun Wang, Jianjun Ji, Yanqun Wang (2013) “Design and Implementation of a 1024-point High-speed FFT Processor Based on the FPGA” in 6th International Congress on Image and Signal Processing (CISP 2013), volume 2, Pages1112 – 1116.

[11]Alijah Ahmed (2013, December), Retrieved from http://scistatcalc.blogspot.in