applied with an unrolling factor of two along with the HLS PARTITION array with cyclic partitioning by a factor of two on the input arrays. Synthesis of the design with these directives applied did not result in the desired output as the HLS tool was unable to determine the static nature of the indexing between the odd and even partitions within the loop structure. This resulted in the tool scheduling two read and write operations for each partitioned array and a multiplexer to dynamically select which partition was necessary for the current iteration. This effectively nullified the unrolling of the loop as no parallelization was achieved due to the incorrect scheduling of the read and write operations. To remedy this, the input arrays were manually partitioned into two arrays of half the size containing the even and odd data. To improve the layout of the code, the butterfly operation was placed in a function. The butterfly function was designed with a C++ function template to ensure that each iteration of the fft stage loop statically selected which of the “ping-pong” memories from which to read and write. This was necessary because the HLS tools cannot resolve dependencies between function calls within a loop that operate on the same array and will always schedule them sequentially. Use of the templated function ensures that only a single version of the templated function will be called in each loop iteration with static memory addressing allowing for parallel execution. The fft group loop was then manually unrolled by a factor of two as shown in Algorithm 13 with the unrolled initial and final stages omitted for clarity. The design was then synthesized producing the timing results shown in Table 4.8. These results show that unrolling the fft group loop resulted in the expected latency reduction of both the FFT and IFFT functions by nearly half. Because the input arrays of the design were partitioned, the weight coeff, mult coeff, and unweight coeff were also unrolled through application of the HLS UNROLL directive with an unrolling factor of two producing the timing results shown in Table 4.9.
In VLSI, design space exploration considering various constraints complex using conventional RTL design flow. The techniques of highlevelsynthesis are useful in abstracting the design to a higher level than in the regular RTL design flow. The various hardware architectures possible need to be explored to bring out the design trade-offs in terms of parameters namely latency, critical path delay and resource utilization. The focus of the work presented here is to explore systolic array mapping methods with and without HLS transformations. Unfolding and pipelining are the HLS transformations applied on the DSP benchmark –FIR filter. Unfolding enhances the possibilities of concurrency in loops and pipeline architecture makes concurrency possible. Vivado HLS tool is used to explore the design space for random subspace mapping and computational subspace mapping and analyze their merits and demerits in terms of the design trade- off performance parameters when the design is mapped to Zynq architecture.
In the past decade, there has been a substantial increase in the level of hardware abstraction that High-LevelSynthesis (HLS) [1-5] tools offer, which has made de- signing a complete System-on-Chip (SoC) much more practical. By designing at the system level, it has become possible for hardware engineers to avoid gate-level se- mantics. HLS tools work by taking applications written in a subset of ANSI C, and translating it into a Register Transfer Level (RTL) module for Application-Specific Integrated Circuit (ASIC) or Field Programmable Gate Arrays (FPGAs) chip design. The design workflow re- quires knowledge of both software to write C applica- tions and hardware to parallelize tasks, resolve timing and memory management issues. There has been signifi- cant previous work that discusses how to teach RTL con- cepts to students and design simple applications for SoCs [6,7]. Nevertheless, the learning curve for software engi- neers is relatively high since they need to use Hardware Descriptive Languages (HDL) such as Verilog and VHDL. By using HLS tools, software engineers can use their pro- gramming skills along with hardware knowledge to create complex embedded hardware/software co-design systems.
In this paper, we have presented a high-levelsynthesis flow that takes into account operators with variable latency depending on data-width. Both ASIC and FPGA platforms can be targeted. Accurate computation path delay models are used for the allocation and scheduling steps. The synthesis process makes it possible to increase the utilization rate of the resources avoiding clock cycle waste. Design latency can be reduced for resource constrained syntheses. In our experiments, design latency saving is about 19% in comparison to a conventional approach for which the propagation delay of an operator is fixed whatever the width of the data it handles. Energy consumption is also reduced. For time-constrained syntheses, area can be reduced. Area saving is about 9% in comparison to a conventional approach.
This dissertation describes research activities broadly concerning the area of High- levelsynthesis (HLS), but more specifically, regarding the HLS-based design of energy-efficient hardware (HW) accelerators. HW accelerators, mostly implemented on FPGAs, are integral to the heterogeneous architectures employed in modern high performance computing (HPC) systems due to their ability to speed up the execution while dramatically reducing the energy consumption of computationally challenging portions of complex applications. Hence, the first activity was regarding an HLS- based approach to directly execute an OpenCL code on an FPGA instead of its traditional GPU-based counterpart. Modern FPGAs offer considerable computational capabilities while consuming significantly smaller power as compared to high-end GPUs. Several different implementations of the K-Nearest Neighbor algorithm were considered on both FPGA- and GPU-based platforms and their performance was compared. FPGAs were generally more energy-efficient than the GPUs in all the test cases. Eventually, we were also able to get a faster (in terms of execution time) FPGA implementation by using an FPGA-specific OpenCL coding style and utilizing suitable HLS directives.
In Chapters 4, 5, and 6 we introduced the H-QED technique and its hy- brid tracing and hybrid hashing variations which utilize HLS principles for quickly detecting bugs inside hardware accelerators in SoCs in both pre-silicon debugging and post-silicon validation scenarios. Our results demonstrate the effectiveness and practicality of H-QED: up to two orders of magnitude improvement in error detection latency, up to a threefold improvement in coverage, less than 10% accelerator-level overhead, and negligible performance overhead. In our pre-silicon hybrid tracing variation, we demonstrate that the technique can pinpoint the source-code location of logic bug activation and provide a strong hint for potential bug fixes to the hardware designer. Further- more, these techniques also discovered previously unknown bugs in the widely used CHStone HLS benchmark suite. Through hybrid hardware/software traces and signatures, our techniques minimize intrusiveness during validation. Thus, the combination of QED and hybrid tracing/hashing provides a sys- tematic approach to validation of complex SoCs consisting of processor cores, uncore components, programmable accelerators, and hardware accelerators. Future directions related to H-QED include:
Field programmable arrays(FPGA) consists of arrays of gates that can be pro- grammed and reconfigured by a designer. Hardware description languages (VHDL/Ver- ilog) are widely used to model a design in a bit and cycle accurate way and map them to FPGA via various available synthesis tools to generate functionally ver- ified bit-stream in an automated flow[1, 2]. Although, multiple companies share FPGA’s market, but Intel(after acquisition of Altera in 2016) and Xilinx are two major competitors. Even though Intel cannot be underestimated especially when it comes to their technology and capital, this work focuses on Xilinx tools and tech- nology. Xilinx is the market leaders of FPGAs with 18-months technology lead. Its products are aimed to meet requirements of various workloads coming from different domains[3, 4], figure 1.2 suggests that more than 50% of FPGA market belongs to Xilinx programmable platforms.
The FFT processor is a critical block in orthogonal frequency division multiplexing (OFDM) technology. Due to the nature of uncontrollable processing on the same clock frequency of sampling data, most preference is given to pipeline FFT especially for a low power solution or high throughput. The commutator and the complex multiplier blocks at each stage contribute a dominating part of the entire power consumption in the pipelined architecture. This paper proposes an optimal design to minimize one of the significant power consuming factors known as the switching activity. The coefficient ordering method is followed to reduce the amount of switching activity between successive coefficients which are used by complex multipliers. The coefficient ordering requires a consistent data sequence as per new ordering OF coefficients. Thus, we can attain the less hardware complexity and maximum efficiency.
In VLSI design power, speed and area are the most often used measures for determining the performance of the VLSI design. Arithmetic is the oldest and most elementary branch of Mathematics. The name Arithmetic comes from the Greek word. Arithmetic is used by almost everyone, for tasks ranging from simple day to day work like counting to advanced science and business calculations. As a result, the need for a faster and efficient Arithmetic Unit in computers has been a topic of interest over decades. The work presented makes use of Vedic Mathematics and goes step by step, by first designing a Vedic Multiplier, then a Multiply Accumulate Unit and then finally an Arithmetic module which uses this multiplier and MAC unit. The four basic operations in elementary arithmetic are addition, subtraction, multiplication and division. Multiplication basically is the mathematical operation of scaling one number by another. Talking about today’s engineering world, multiplication based operations are some of the frequently used Functions, currently implemented in many Digital Signal Processing (DSP) applications such as Convolution, Fast Fourier Transform, filtering and in Arithmetic Logic Unit (ALU) of Microprocessors. Since multiplication is such a frequently used operation, it’s necessary for a multiplier to be fast and power efficient and so, development of a fast and low power multiplier has been a subject of interest over decades.
When the word length to be 16 bits. The single simple multiplier implementation needs 16 rows of partial product generation and each row containing 16 partial product bits. To accumulate these 16 partial product rows large hardware will be needed to get the result in sum and carry form. As implementation of 16 pt radix-2 FFT requires large no of multiplication and these all multiplications are done using Vedic mathematics reduces the time, area and power.
In most of the DSP algorithms, the performance of the algorithm is based on the path delay of the multiplier. The speed of multiplication is very important in DSP as well as in general processors. In the early period, multiplications were implemented generally with a sequence of shift and add operations. There have been many algorithms proposed in literature to perform multiplication, each providing various advantages and having tradeoff in terms of speed, circuit complexity and area.Also, multiplication dominates the execution time of most DSP applications and hence there is a need of high speed multiplier for designing an efficient ALU (Himanshu 2004). For this, an ancient system of calculation which was rediscovered from Vedas by Sri Bharati Krushna Tirthaji Maharaj known as “Vedic Mathematics” is used. The peculiarity of Vedic Mathematics is because of its simplicity and flexibility in carrying out the calculations mentally (Jayaprakashet al. 2014). This gives us the liberty to choose the technique most suitable for us. According to Tirthaji, all of Vedic mathematics is based on sixteen Sutras , which are actually
FFT and IFFT commonly used algorithm for processing signals. It can be used for WLAN, image process, spectrum measurements, radar and multimedia communication services. Now a days, FFT processors were using in wireless communication systemsthat are having fast execution and low power consumption. These are some most important constraints of FFT processor. Complex multiplication is main arithmetic operation used in FFT/IFFT blocks. This is the main issue in processor. It is time consuming and it consumes a large chip area and power. When large point FFT is to be designed, it increases the complexity. To reduce the complexity of the multiplication, there are two methods one simple method is to real and constant multiplications take the
 Himanshu Thapaliyal and M.B Srinivas, “VLSI Implementation of RSA Encryption System Using Ancient Indian Vedic Mathematics”, Center for VLSI and Embedded System Technologies, International Institute of Information Technology Hyderabad, India.  Charles. Roth Jr., “Digital Systems Design using VHDL”, Thomson Brooks/Cole, 7th reprint, 2005.
Multipliers are the important unit in digital systems and other applications related to digital processing . Several researchers have tried designing the multipliers which met either one of the two constraints i.e. low power consumption and low area utilization and the high speed or a combination of them. The multiplication algorithm uses add and shift methodology . Variety of partial product values is superimposed on the parallel numbers to enhance the performance of the multipliers. To implement speed constraint Baugh Wooley multiplier algorithm is used.
In this work it demonstrates that the ideal quantized coefficients (as far as mean square mistake) can be acquired as the arrangement of a whole number quadratic programming issue. New fixed width multiplier topologies, with various region exactness exchange off, are at that point gotten by changing the quantization plot. Two topologies are specifically chosen as the best ones. The first (named "2 bits settled width multiplier") depends on a uniform coefficient quantization with two bits. The second topology ("1.5 pieces settled width multiplier") depends on a non-uniform quantization where a portion of the coefficients (around one-portion of the aggregate coefficients) are quantized with two bits, while the rest of the coefficients are quantized with a solitary piece. The proposed fixed width multiplier topologies display better precision concerning past arrangements, near the hypothetical lower bound.
ABSTRACT: A multiplier is one of the key hardware blocks in most digital signal processing (DSP) systems. Typical DSP applications where a multiplier plays an important role include digital filtering, digital communications and spectral analysis. Many current DSP applications are targeted at portable, battery-operated systems, so that power dissipation becomes one of the primary design constraints. Since multipliers are rather complex circuits and must typically operate at a high system clock rate, reducing the delay of a multiplier is an essential part of satisfying the overall design.. This paper puts forward a high speed multiplier ,which is efficient in terms of speed, making use of UrdhvaTiryagbhyam, a sutra from Vedic Maths for multiplication and half adder for addition of partial products. The code is written in VHDL and results shows that multiplier implemented using Vedic multiplication is efficient in terms of area and speed compared to its implementation using Array and Booth multiplier architectures.
fast and expensive in term of area. The digit serial architecture is flexible, it has moderate speed and reasonable cost of implementation. Two low-energy digit serial PB multipliers have been proposed binary tree structure of XOR gates are used instead of a linear array of XOR gates far degree reduction, reduce both power consumption and delay. Various digit serial multipliers were proposed Such as most significant digit, least Significant digit with modifications in architecture. A factoring technique is involved in design of a digit serial PB multiplier in GF.
Power dissipation in CMOS circuits is caused by three main sources: 1) the charging and discharging of capacitive loads due to change in input logic levels. 2) the short- circuit current arises because of the direct current path between the supply rails during output transitions and 3) the leakage current which is determined by the fabrication technology, consist reverse bias current in the parasitic diodes formed between source and drain diffusions and the bulk region in a transistor as well as the sub threshold current that arises from the inversion charge that exists at the gate voltages below the threshold voltage, The short- circuit and leakage currents in CMOS circuits can be made small with proper device and circuit design techniques. The dominant source of power consumption is the charging- discharging of the node capacitances and it can be minimizing by reducing switching activity of transistors. Switching activity of the digital circuits is also a function of the logic style used to implement the circuit.
In the proposed architecture, partial product reduction is accomplished by the use of 4:2, 5:2 compressor structures and the final stage of addition is performed by a Sklansky adder. This multiplier architecture comprises of a partial product generation stage, partial product reduction stage and the final addition stage. The latency in the Wallace tree multiplier can be reduced by decreasing the number of adders in the partial products reduction stage. In the proposed architecture, multi bit compressors are used for realizing the reduction in the number of partial product addition stages. The combined factors of low power, low
Most computer arithmetic applications are implemented using digital logic circuits, thus operates with a high reliability and precision. However, many applications such as, multimedia and image processing can tolerate errors and imprecision in computation and still produce meaningful and useful results. The paradigm of inexact computation relies on relaxing fully precise and completely deterministic building modules for designing energy-efficient systems. In digital designs, integer multiplication is one of the fundamental building blocks, which affects the microprocessor and DSP performance. Fast non-booth multipliers mostly use well known schemes such as Wallace, Dadda or the Three-Dimensional Method (TDM). These are all based on a carry-save compression tree, which utilizes full adders and half adders to turn a multi operand sum to two operand addition, which is further realized by using a final carry-propagate adder. In this paper we have considered an approximate multiplier which uses a speculative functional unit.