ISSN Vol.04, Issue.10, October-2016, Pages:

(1)

ISSN 2322-0929 Vol.04, Issue.10, October-2016, Pages:0931-0936

www.ijvdcs.org

VLSI Computational Architectures for the Arithmetic Cosine Transform

D. D

IVYASRI¹

, N.A

SHOK

K

UMAR²

, G.C

HANDRASHEKAR

R

EDDY³

1PG Scholar, Dept of ECE(VLSI), Avanthi’s Scientific Technological & Research Academy Hyderabad, TS, India, E-mail: [email protected].

2Assoc Prof& HOD, Dept of ECE, Avanthi’s Scientific Technological & Research Academy Hyderabad, TS, India, E-mail: [email protected].

³Assistant Professor, Dept of VLSI, Avanthi’s Scientific Technological & Research Academy Hyderabad, TS, India.

Abstract: This paper introduces the Arithmetic Cosine Transform (ACT), speedy algorithm for the evaluation of Discrete Cosine Transform (DCT) in digital signal processing. The common algorithms which used to calculate accurate value of DCT includes floating point operations and are mainly concentrated on multiplication which in turn causes the round-off error to occur. The Arithmetic Cosine Transform is proposed for the rapid calculation of DCT where the computation is focused on addition and constant multiplications which reduces the internal errors like round off and truncation. ACT results in less area consumption and low power consuming operations in case of zero-mean input signals. Calculation of ACT can be done with reasonable accuracy, reduction in area and less power using the novel architecture described for non-null mean signals as well. For the computation of eight-point DCT, ten non-uniform sampling instances are required when ACT is introduced. The implementation are done with Xilinx ISE Design suite 10.1 and coded in Verilog HDL. The two architectures are simulated and synthesized using Cadence encounter and the physical design is obtained.

Keywords: Discrete Cosine Transform, Arithmetic Cosine Transform, Fast Algorithms, VLSI.

I. INTRODUCTION

A discrete cosine transform (DCT) expresses a sequence of finitely many data points in terms of a sum of cosine functions oscillating at different frequencies. DCTs are important to numerous applications in science and engineering, from lossy compression of audio and images (where small high frequency components can be discarded), to spectral methods for the numerical solution of partial differential equations. The use of cosine rather than sine functions is critical in applications such as compression. The cosine functions are much more efficient where as for differential equations the cosines express a particular choice of boundary conditions. As like Fourier-related transform, DFT, discrete cosine transforms (DCTs) express a function or a signal in terms of a sum of sinusoids with different frequencies and amplitudes. And which operates on a function at a finite number of discrete data points. However, this visible difference is merely a consequence of a deeper distinction. A DCT implies different boundary conditions than the DFT or other related transforms. Frequency analysis of discrete time signals is most convenient in DCT. Discrete cosine transform is the most popular transform technique for image compression and is adopted on various standardized coding schemes. Some applications require real-time manipulation of digital images.

Because this, fast algorithms and specific circuits for DCT have been developed. Among the methods for two-dimensional DCT, the indirect method based on row-column decomposition is the best method for hardware implementation.

The energy compaction property of the DCT is well suited for image compression since, as in most images, the energy is concentrated in the low to middle frequencies, and the human eye is more sensitive to the middle frequencies. A large majority of useful image contents change relatively slowly across images, i.e., it is unusual for intensity values to alter up and down several times in a small area, for example, within an 8 x 8 image block. Translate this into the spatial frequency domain; it says that, generally, lower spatial frequency components contain more information than the high frequency components which often correspond to less useful details and noises. The Discrete Cosine Transform transforms data into a format that can be easily compressed. The characteristics of the DCT make it ideally suited for image compression algorithms. These algorithms let you minimize the amount of data needed to recreate a digitized image. Reducing digitized images into the least amount of data possible has some advantages such as less memory required to store images, less time may be needed to analyze images, Channel bandwidth efficiency increased when transmitting images. Performing the DCT on a digitized image creates a data array that can be compressed by data compaction algorithms. Then, data can be stored or transmitted in its compacted form. The image quality depends on the amount of quantization used in the compaction algorithm. To reproduce the original image, the data is retrieved from memory, un- compacted, and an inverse DCT is performed.

(2)

D.DIVYASRI,N.ASHOK KUMAR,G.CHANDRASHEKAR REDDY

International Journal of VLSI System Design and Communication Systems Some of today's most popular image data compression

applications include, Teleconferencing using motion- compensated video code’s, ISDN multimedia communications including voice, video, text, and images, Video channel transmission using commercial geosynchronous tele communications satellites, Digital facsimile transmission using dedicated equipment and personal computers. Several image data compression algorithms use the DCT to remove spatial data redundancies in two-dimensional (2D) data. Images are subdivided into smaller, two-dimensional blocks. These blocks are then processed independently of the neighboring blocks. In general, the two dimensional, discrete cosine transform (2D DCT) transforms an (n x n) data array into an (n x n) result array. First the DCT transforms the columns, and then it transforms the rows.

II. PROPOSED METHOD A. DCT using Distributed Arithmetic

Distributed arithmetic is a bit level rearrangement of a multiply accumulate to hide the multiplications. It is a powerful technique for reducing the size of a parallel hardware multiply accumulates that is well suited to FPGA designs. It can also be extended to other sum functions such as complex multiplies, Fourier transforms. Distributed arithmetic (DA) is an effective method for computing inner products. It uses Look up Tables (LUT) and accumulators instead of multipliers. Distributed arithmetic (DA) provides application in Very Large Scale Integration (VLSI) implementations of Digital Signal Processing (DSP) algorithms. Most of these applications, for example Discrete Cosine Transform (DCT) calculation, are arithmetic intensive with multiply/accumulate (MAC) being the predominant operation. The advantage of DA approach is that it alerts the basic assumption of using multipliers and adders for computing the DCT.

Fig.1. Distributed Arithmetic

DCT is a computational intensive operation. It requires large number of adders and multipliers for direct implementation .Multipliers consume more power and hence distributed arithmetic (DA) is used to implement multiplication without multiplier. The above 8 equations F(0) to F(7) are analysed and instead of multiplier Distributed Arithmetic (DA) is used for the architecture of 1D-DCT.

III. DISCRETE COSINE TRANSFORM

The Discrete Cosine Transform is a conventional signal processing technique used in number of applications. It’s

property states that the DCT coefficients contains most of the relevant information about the image so that it can be used in image compression applications. The Arithmetic Cosine Transform algorithms (ACT) are used for quick computation of DCT. The ACT consists of only additions and multiplying with constant value. This overcomes the errors associated with rounding off the values when floating point operations are come in to picture. The exact evaluation of ACT is possible if the input data are non-uniformly sampled and has zero mean. This paper unfolds two main issues (i) calculation of mean value of input signal in case of non-uniformly sampled data and (ii) proposition of efficient architectures of ACT for calculating the 8 point DCT when the input data are considered as only non- uniform samples.

Fig.2. Overall Architecture for DA Base DCT

A. ACT Architectures

These papers introduces architectures for the ACT which accepts only non-uniform samples as inputs and calculate the DCT with reduced area complexity and low power consumption. All the above explained methodologies are used for the design of these architectures. There are registers are introduced at different nodes for the temporary storage which gives a fully pipelined structure to the design. This pipelined structure reduces the critical path delay with a slight increase in the latency. Architecture I corresponds to the ACT architecture for computing the DCT of null mean input signals. This architecture can be realized. This architecture is done with only additions and constant multiplication with integers which reduces the truncation error and complexity. The Architecture I shown in Fig.3 corresponds to N=8 which takes 10 non-uniform samples as inputs according to the values of the set S given. The applications dealing with zero mean input signal uses this architecture with the advantages of less complex computation and area. The simulation result for the Architecture I. The second architecture is used for the calculation of DCT which has non-null mean in-put signals. It is desired to calculate the mean value of the incoming non-uniform samples. The Mertens correction function is included. Architecture II consists of Architecture I, mean calculation block and Mertens correction block as shown in Fig.3.

(3)

VLSI Computational Architectures for the Arithmetic Cosine Transform

International Journal of VLSI System Design and Communication Systems Fig.3. Architecture I for Null Mean Input Signals

Fig.4. Architecture II for Non-Null Means Input Signals

B. Computing Arithmetic Mean

The calculation of mean value is required if the incoming signals are of non null mean type. The architecture is realized using. Here the input signals are scaled by the sampling instants and are given as input to the mean value calculation block. In the next step, each input is multiplied by the corresponding interpolation weight. The final step of mean value computation is to add all these values which will be the mean value of the incoming non null mean sequence.

Fig.5. Mean Value Calculation.

C. Mertens Correction Factor

For non null mean input sequence, it is required to subtract a modification term in order to get the DCT coefficients. This is called the Mertens correction term, M (n). This term is the sum of the Mobus function.

Fig.6. Mertens Correction Block

IV. IMPLEMENTATION AND RESULTS A. FPGA Implementation

We implemented both architectures described in the previous section. These architectures were tested on Xilinx Virtex-6 XC6VLX240T FPGA using the stepped hardware co-simulation feature in ML605 evaluation platform. They were also fully pipelined to achieve the maximum throughput. Word-length is L at the inputs, which are assumed to be in the range [-1, 1].

Throughout the fixed point implementation the word-length increases to avoid overflow. Depending on the particular quantization point, the actual allocated word-length is given by L + ∆L, where the values of ∆L are listed in Table 2 for both proposed architectures. The referred quantization points are shown in Figs. 7 and 8. The numbers of fractional bits are maintained constant throughout the design and are equal to L-1.

Accuracy of the results from Architectures I and II were tested with varying values of L by using average percentage error and peak signal to noise ratio (PSNR) as figures of merit. Adopted figures of merit employed the DCT coefficients calculated from the floating point implementation of the DCT available in Matlab as reference. Results given in Table 3 are taken from the simulation of Architectures I and II using 10₄ random input signals. The reduction of the input word-length L degrades the results furnished by the considered figures of merit. However, for small word-lengths, the errors incurred are tolerable for most applications. Table 1 shows the resource utilization, power consumption and operational frequency on the Xilinx Virtex-6 XC6VLX240T FPGA device for input fixed point word-lengths

TABLE I. Computational Complexity of Proposed Architecture I and Architecture II

(4)

International Journal of VLSI System Design and Communication Systems TABLE II. Fixed Point Word-Length Increase ∆L at Each

Quantization Point of the ACT Signal Flow Graph

TABLE III. Average Percentage Error and Average Peak Signal to Noise Ratio of ACT Implementations with Fixed Point Input Word-Length L, when Tested with 10,000 Input

Vectors

(L) 8 and 12. Information about the Xilinx FPGA resources that are listed in Table 4 including slices, slice FFs and four-input look-up tables (LUTs) can be found in the device datasheet.

Architecture I is multiplier-free and possesses the lower complexity, but it is only suitable for null mean signals. To remove the dependence of power consumption to operating frequency the normalized power metric (dynamic power normalized to operating frequency) is given in Table 4. The total power consumption in the FPGA is dominated by the static power since both architectures only occupied roughly 1 percent of the available area.

B. ASIC Synthesis Results

The proposed architecture I and II are synthesized for application specific integrated circuits (ASIC) using the Cadence RTL Compiler for 45 nm technology. The freePDK45 standard-cell library is used in synthesis with optimization goal set to maximize the speed. Our synthesis was performed at operating voltage of 1.1 V. The area, power, operational frequency, and normalized power metric (dynamic power normalized to operating frequency and square of the supply voltage) for the ASIC synthesis are presented in Table 5. Table 6 shows the comparison of results between proposed ACT Architectures I and II and other published eight-point DCT implementations. Ideally, a fair comparison requires all implementations to be of the same process, operating frequency, and supply voltage. However, the published literature contains varying technology and operational conditions. Hence in Table 6 a normalized power consumption value is given, where the power consumption is normalized to the corresponding operational frequency and square of supply voltage. From the normalized power consumption given in Table 6 it’s apparent that the proposed architectures consume lower power than

architectures. We emphasize that the proposed Architecture I has the distinct advantage of having exact computation. Thus approximate DCT methods as suggested in were not taken into consideration for comparison purposes.

TABLE IV. Speed of Operation, Resource Utilization and Power Consumption of the XC6VLX240T FPGA Device

Used for Input Fixed Point Word-Lengths L and for Architectures I and II

TABLE V. Speed of Operation, Power Consumption and Area Utilization in ASIC Synthesis Results for Fixed Point

Word-Lengths L for Architectures I and II (45 nm technology)

TABLE VI. Comparison of the Proposed Implementation with Published DCT Implementations

(5)

VLSI Computational Architectures for the Arithmetic Cosine Transform

International Journal of VLSI System Design and Communication Systems Fig.7. (a) Null Mean ACT and (b) Mean Calculation Block.

Fig.8. Architecture II: Non-Null Mean DCT Calculation Using the Mertens Correction Block.

V. CONCLUSION

The various algorithms for the evaluation of Discrete Cosine Transform are analyzed by considering their number of addition operations, number of multiplications needed, computational difficulties, complexity of area, and probability of occurrence of error and power consumption. The Arithmetic Cosine Transform is found to be a fast one for the computation of DCT.

It has got reduced architectural complexity with only adders and constant integer multipliers which make the structure to be free from the truncation errors associated with the floating point operations. The two architectures are designed for null mean and non-null mean input signals which are only non-uniformly sampled.

VI. REFERENCES

[1]Nilanka Rajapaksha, Student Member, IEEE, Arjuna Madanayake, Member, IEEE, Renato J. Cintra, Senior Member,

IEEE, Jithra Adikari, Member, IEEE, and Vassil S. Dimitrov,

“VLSI Computational Architectures for the Arithmetic Cosine Transform”, IEEE Transactions on Computers, Vol. 64, No. 9, September 2015.

[2]N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete cosine transform,”IEEE Trans.Comput.vol.23,no.1,pp.90–93,Jan. 1974.

[3]C. Chakrabarti and J. J_aJ_a, “Systolic architectures for the computation of the discrete Hartley and the discrete cosine transforms based on prime factor decomposition,” IEEE Trans.

Comput., vol. 39, no. 11, pp. 1359–1368, Nov. 1990.

[4]F. A. Kamangar and K. R. Rao, “Fast algorithms for the 2-D discrete cosine transform,” IEEE Trans. Comput., vol. 31, no. 9, pp. 899–906, Sep. 1982.

[5]H. Kitajima, “A symmetric cosine transform,” IEEE Trans.

Comput., vol. 21, no. 4, pp. 317–323, Apr. 1980.

[6]S. Yu and E. Swartziander Jr, ,“DCT implementation with distributed arithmetic,” IEEE Trans. Comput., vol. 50, no. 9, pp.

985–991, Sep. 2001.

[7]V. Britanak, P. Yip, and K. R. Rao, Discrete Cosine and Sine Transforms. Amsterdam, The Netherlands: Academic Press, 2007.

[8]N. Romaand L. Sousa,“Efficient hybrid DCT-domain algorithm for video spatial downscaling,” EURASIP J. Adv.

Signal Process., vol. 2007, no. 2, pp. 30–30, 2007.

[9]H. Lin and W. Chang, “High dynamic range imaging for stereoscopic scene representation,” in Proc. 16th IEEE Int.

Conf. Image Process., Nov. 2009, pp. 4305–4308.

[10]E. Magli and D. Taubman, “Image compression practices and standards for geospatial information systems,” in Proc.

IEEE Int. Geosci. Remote Sens. Symp., Jul. 2003, vol. 1, pp.

654–656.

[11]M. Bramberger, J. Brunner, B. Rinner, and H. Schwabach,

“Real-time video analysis on an embedded smart camera for traffic surveillance,” in Proc. 10th IEEE Real-Time Embedded Technol. Appl. Symp., May. 2004, pp. 174–181.

[12]C. F. Chiasserini and E. Magli, “Energy consumption and image quality in wireless video-surveillance networks,” in Proc.

13^th IEEE Int. Symp. Pers., Indoor Mobile Radio Commun., Sep. 2002, vol. 5, pp. 2357–2361.

Author’s Profile:

D. Divya Sri, Department of VLSI, from Avanthi’s Scientific Technological & Research Academy, India.

E-mail: [email protected].

Mr.N.Ashok Kumar, received the Master of Technology degree in Radar and Micro Engineering from Andhra University Vijayawada, he received the Bachelor Of Engineering degree from V.R.Siddhartha Engineering College-JNTUH. He is currently working as Associate Professor and a Head of the Department of ECE with Avanthi’s Scientific Technological & Research Academy Hyderabad. His interest subjects are Embedded Systems, Microprocessors, Communication Systems, Digital Electronics and etc.

E-mail: [email protected].

(6)

International Journal of VLSI System Design and Communication Systems Mr.G.Chandrashekar Reddy, received the

Master of Technology degree in VLSI Systems Design from Avanthi’s Scientific Technological

& Research Academy-JNTUH, he received the Bachelor of Engineering degree in ECE from Vaageswari College of Engineering-JNTUH. He is currently working as assistant Professor of ECE with Avanthi’s Scientific Technological & Research Academy Hyderabad. His interest subjects are Embedded Systems, Microprocessors, Communication Systems, Digital Electronics and etc.