applied with an unrolling factor of two along with the HLS PARTITION array with cyclic partitioning by a factor of two on the input arrays. **Synthesis** of the **design** with these directives applied did not result in the desired output as the HLS tool was unable to determine the static nature of the indexing between the odd and even partitions within the loop structure. This resulted in the tool scheduling two read and write operations for each partitioned array and a multiplexer to dynamically select which partition was necessary for the current iteration. This effectively nullified the unrolling of the loop as no parallelization was achieved due to the incorrect scheduling of the read and write operations. To remedy this, the input arrays were manually partitioned into two arrays of half the size containing the even and odd data. To improve the layout of the code, the butterfly operation was placed in a function. The butterfly function was designed with a C++ function template to ensure that each iteration of the **fft** stage loop statically selected which of the “ping-pong” memories from which to read and write. This was necessary because the HLS tools cannot resolve dependencies between function calls within a loop that operate on the same array and will always schedule them sequentially. Use of the templated function ensures that only a single version of the templated function will be called in each loop iteration with static memory addressing allowing for parallel execution. The **fft** group loop was then manually unrolled by a factor of two as shown in Algorithm 13 with the unrolled initial and final stages omitted for clarity. The **design** was then synthesized producing the timing results shown in Table 4.8. These results show that unrolling the **fft** group loop resulted in the expected latency reduction of both the **FFT** and IFFT functions by nearly half. Because the input arrays of the **design** were partitioned, the weight coeff, mult coeff, and unweight coeff were also unrolled through application of the HLS UNROLL directive with an unrolling factor of two producing the timing results shown in Table 4.9.

Show more
96 Read more

In VLSI, **design** space exploration considering various constraints complex using conventional RTL **design** flow. The techniques of **high** **level** **synthesis** are useful in abstracting the **design** to a higher **level** than in the regular RTL **design** flow. The various hardware architectures possible need to be explored to bring out the **design** trade-offs in terms of parameters namely latency, critical path delay and resource utilization. The focus of the work presented here is to explore systolic array mapping methods with and without HLS transformations. Unfolding and pipelining are the HLS transformations applied on the DSP benchmark –FIR filter. Unfolding enhances the possibilities of concurrency in loops and pipeline architecture makes concurrency possible. Vivado HLS tool is used to explore the **design** space for random subspace mapping and computational subspace mapping and analyze their merits and demerits in terms of the **design** trade- off performance parameters when the **design** is mapped to Zynq architecture.

Show more
In the past decade, there has been a substantial increase in the **level** of hardware abstraction that **High**-**Level** **Synthesis** (HLS) [1-5] tools offer, which has made de- signing a complete System-on-Chip (SoC) much more practical. By designing at the system **level**, it has become possible for hardware engineers to avoid gate-**level** se- mantics. HLS tools work by taking applications written in a subset of ANSI C, and translating it into a Register Transfer **Level** (RTL) module for Application-Specific Integrated Circuit (ASIC) or Field Programmable Gate Arrays (FPGAs) chip **design**. The **design** workflow re- quires knowledge of both software to write C applica- tions and hardware to parallelize tasks, resolve timing and memory management issues. There has been signifi- cant previous work that discusses how to teach RTL con- cepts to students and **design** simple applications for SoCs [6,7]. Nevertheless, the learning curve for software engi- neers is relatively **high** since they need to use Hardware Descriptive Languages (HDL) such as Verilog and VHDL. By using HLS tools, software engineers can use their pro- gramming skills along with hardware knowledge to create complex embedded hardware/software co-**design** systems.

Show more
In this paper, we have presented a **high**-**level** **synthesis** flow that takes into account operators with variable latency depending on data-width. Both ASIC and FPGA platforms can be targeted. Accurate computation path delay models are used for the allocation and scheduling steps. The **synthesis** process makes it possible to increase the utilization rate of the resources avoiding clock cycle waste. **Design** latency can be reduced for resource constrained syntheses. In our experiments, **design** latency saving is about 19% in comparison to a conventional approach for which the propagation delay of an operator is fixed whatever the width of the data it handles. Energy consumption is also reduced. For time-constrained syntheses, area can be reduced. Area saving is about 9% in comparison to a conventional approach.

Show more
11 Read more

This dissertation describes research activities broadly concerning the area of **High**- **level** **synthesis** (HLS), but more specifically, regarding the HLS-based **design** of energy-efficient hardware (HW) accelerators. HW accelerators, mostly implemented on FPGAs, are integral to the heterogeneous architectures employed in modern **high** performance computing (HPC) systems due to their ability to speed up the execution while dramatically reducing the energy consumption of computationally challenging portions of complex applications. Hence, the first activity was regarding an HLS- based approach to directly execute an OpenCL code on an FPGA instead of its traditional GPU-based counterpart. Modern FPGAs offer considerable computational capabilities while consuming significantly smaller power as compared to **high**-end GPUs. Several different implementations of the K-Nearest Neighbor algorithm were considered on both FPGA- and GPU-based platforms and their performance was compared. FPGAs were generally more energy-efficient than the GPUs in all the test cases. Eventually, we were also able to get a faster (in terms of execution time) FPGA implementation by using an FPGA-specific OpenCL coding style and utilizing suitable HLS directives.

Show more
109 Read more

In Chapters 4, 5, and 6 we introduced the H-QED technique and its hy- brid tracing and hybrid hashing variations which utilize HLS principles for quickly detecting bugs inside hardware accelerators in SoCs in both pre-silicon debugging and post-silicon validation scenarios. Our results demonstrate the effectiveness and practicality of H-QED: up to two orders of magnitude improvement in error detection latency, up to a threefold improvement in coverage, less than 10% accelerator-**level** overhead, and negligible performance overhead. In our pre-silicon hybrid tracing variation, we demonstrate that the technique can pinpoint the source-code location of logic bug activation and provide a strong hint for potential bug fixes to the hardware designer. Further- more, these techniques also discovered previously unknown bugs in the widely used CHStone HLS benchmark suite. Through hybrid hardware/software traces and signatures, our techniques minimize intrusiveness during validation. Thus, the combination of QED and hybrid tracing/hashing provides a sys- tematic approach to validation of complex SoCs consisting of processor cores, uncore components, programmable accelerators, and hardware accelerators. Future directions related to H-QED include:

Show more
128 Read more

Field programmable arrays(FPGA) consists of arrays of gates that can be pro- grammed and reconfigured by a designer. Hardware description languages (VHDL/Ver- ilog) are widely used to model a **design** in a bit and cycle accurate way and map them to FPGA via various available **synthesis** tools to generate functionally ver- ified bit-stream in an automated flow[1, 2]. Although, multiple companies share FPGA’s market, but Intel(after acquisition of Altera in 2016) and Xilinx are two major competitors. Even though Intel cannot be underestimated especially when it comes to their technology and capital, this work focuses on Xilinx tools and tech- nology. Xilinx is the market leaders of FPGAs with 18-months technology lead. Its products are aimed to meet requirements of various workloads coming from different domains[3, 4], figure 1.2 suggests that more than 50% of FPGA market belongs to Xilinx programmable platforms.

Show more
126 Read more

The **FFT** processor is a critical block in orthogonal frequency division multiplexing (OFDM) technology. Due to the nature of uncontrollable processing on the same clock frequency of sampling data, most preference is given to pipeline **FFT** especially for a low power solution or **high** throughput. The commutator and the complex **multiplier** blocks at each stage contribute a dominating part of the entire power consumption in the pipelined architecture. This paper proposes an optimal **design** to minimize one of the significant power consuming factors known as the switching activity. The coefficient ordering method is followed to reduce the amount of switching activity between successive coefficients which are used by complex multipliers. The coefficient ordering requires a consistent data sequence as per new ordering OF coefficients. Thus, we can attain the less hardware complexity and maximum efficiency.

Show more
In VLSI **design** power, speed and area are the most often used measures for determining the performance of the VLSI **design**. Arithmetic is the oldest and most elementary branch of Mathematics. The name Arithmetic comes from the Greek word. Arithmetic is used by almost everyone, for tasks ranging from simple day to day work like counting to advanced science and business calculations. As a result, the need for a faster and efficient Arithmetic Unit in computers has been a topic of interest over decades. The work presented makes use of Vedic Mathematics and goes step by step, by first designing a Vedic **Multiplier**, then a Multiply Accumulate Unit and then finally an Arithmetic module which uses this **multiplier** and MAC unit. The four basic operations in elementary arithmetic are addition, subtraction, multiplication and division. Multiplication basically is the mathematical operation of scaling one number by another. Talking about today’s engineering world, multiplication based operations are some of the frequently used Functions, currently implemented in many Digital Signal Processing (DSP) applications such as Convolution, Fast Fourier Transform, filtering and in Arithmetic Logic Unit (ALU) of Microprocessors. Since multiplication is such a frequently used operation, it’s necessary for a **multiplier** to be fast and power efficient and so, development of a fast and low power **multiplier** has been a subject of interest over decades.

Show more
When the word length to be 16 bits. The single simple **multiplier** implementation needs 16 rows of partial product generation and each row containing 16 partial product bits. To accumulate these 16 partial product rows large hardware will be needed to get the result in sum and carry form. As implementation of 16 pt radix-2 **FFT** requires large no of multiplication and these all multiplications are done using Vedic mathematics reduces the time, area and power.

In most of the DSP algorithms, the performance of the algorithm is based on the path delay of the **multiplier**. The speed of multiplication is very important in DSP as well as in general processors. In the early period, multiplications were implemented generally with a sequence of shift and add operations. There have been many algorithms proposed in literature to perform multiplication, each providing various advantages and having tradeoff in terms of speed, circuit complexity and area.Also, multiplication dominates the execution time of most DSP applications and hence there is a need of **high** speed **multiplier** for designing an efficient ALU (Himanshu 2004). For this, an ancient system of calculation which was rediscovered from Vedas by Sri Bharati Krushna Tirthaji Maharaj known as “Vedic Mathematics” is used. The peculiarity of Vedic Mathematics is because of its simplicity and flexibility in carrying out the calculations mentally (Jayaprakashet al. 2014). This gives us the liberty to choose the technique most suitable for us. According to Tirthaji, all of Vedic mathematics is based on sixteen Sutras , which are actually

Show more
[3] Himanshu Thapaliyal and M.B Srinivas, “VLSI Implementation of RSA Encryption System Using Ancient Indian Vedic Mathematics”, Center for VLSI and Embedded System Technologies, International Institute of Information Technology Hyderabad, India. [4] Charles. Roth Jr., “Digital Systems **Design** using VHDL”, Thomson Brooks/Cole, 7th reprint, 2005.

Multipliers are the important unit in digital systems and other applications related to digital processing [1]. Several researchers have tried designing the multipliers which met either one of the two constraints i.e. low power consumption and low area utilization and the **high** speed or a combination of them. The multiplication algorithm uses add and shift methodology [2]. Variety of partial product values is superimposed on the parallel numbers to enhance the performance of the multipliers. To implement speed constraint Baugh Wooley **multiplier** algorithm is used.

Show more
In this work it demonstrates that the ideal quantized coefficients (as far as mean square mistake) can be acquired as the arrangement of a whole number quadratic programming issue. New fixed width **multiplier** topologies, with various region exactness exchange off, are at that point gotten by changing the quantization plot. Two topologies are specifically chosen as the best ones. The first (named "2 bits settled width **multiplier**") depends on a uniform coefficient quantization with two bits. The second topology ("1.5 pieces settled width **multiplier**") depends on a non-uniform quantization where a portion of the coefficients (around one-portion of the aggregate coefficients) are quantized with two bits, while the rest of the coefficients are quantized with a solitary piece. The proposed fixed width **multiplier** topologies display better precision concerning past arrangements, near the hypothetical lower bound.

Show more
ABSTRACT: A **multiplier** is one of the key hardware blocks in most digital signal processing (DSP) systems. Typical DSP applications where a **multiplier** plays an important role include digital filtering, digital communications and spectral analysis. Many current DSP applications are targeted at portable, battery-operated systems, so that power dissipation becomes one of the primary **design** constraints. Since multipliers are rather complex circuits and must typically operate at a **high** system clock rate, reducing the delay of a **multiplier** is an essential part of satisfying the overall **design**.. This paper puts forward a **high** speed **multiplier** ,which is efficient in terms of speed, making use of UrdhvaTiryagbhyam[1], a sutra from Vedic Maths for multiplication and half adder for addition of partial products. The code is written in VHDL and results shows that **multiplier** implemented using Vedic multiplication is efficient in terms of area and speed compared to its implementation using Array and Booth **multiplier** architectures.

Show more
fast and expensive in term of area. The digit serial architecture is **flexible**, it has moderate speed and reasonable cost of implementation. Two low-energy digit serial PB multipliers have been proposed binary tree structure of XOR gates are used instead of a linear array of XOR gates far degree reduction, reduce both power consumption and delay. Various digit serial multipliers were proposed Such as most significant digit, least Significant digit with modifications in architecture. A factoring technique is involved in **design** of a digit serial PB **multiplier** in GF.

Show more
Power dissipation in CMOS circuits is caused by three main sources: 1) the charging and discharging of capacitive loads due to change in input logic levels. 2) the short- circuit current arises because of the direct current path between the supply rails during output transitions and 3) the leakage current which is determined by the fabrication technology, consist reverse bias current in the parasitic diodes formed between source and drain diffusions and the bulk region in a transistor as well as the sub threshold current that arises from the inversion charge that exists at the gate voltages below the threshold voltage, The short- circuit and leakage currents in CMOS circuits can be made small with proper device and circuit **design** techniques. The dominant source of power consumption is the charging- discharging of the node capacitances and it can be minimizing by reducing switching activity of transistors. Switching activity of the digital circuits is also a function of the logic style used to implement the circuit.

Show more
In the proposed architecture, partial product reduction is accomplished by the use of 4:2, 5:2 compressor structures and the final stage of addition is performed by a Sklansky adder. This **multiplier** architecture comprises of a partial product generation stage, partial product reduction stage and the final addition stage. The latency in the Wallace tree **multiplier** can be reduced by decreasing the number of adders in the partial products reduction stage. In the proposed architecture, multi bit compressors are used for realizing the reduction in the number of partial product addition stages. The combined factors of low power, low

Show more
13 Read more

Most computer arithmetic applications are implemented using digital logic circuits, thus operates with a **high** reliability and precision. However, many applications such as, multimedia and image processing can tolerate errors and imprecision in computation and still produce meaningful and useful results. The paradigm of inexact computation relies on relaxing fully precise and completely deterministic building modules for designing energy-efficient systems. In digital designs, integer multiplication is one of the fundamental building blocks, which affects the microprocessor and DSP performance. Fast non-booth multipliers mostly use well known schemes such as Wallace, Dadda or the Three-Dimensional Method (TDM). These are all based on a carry-save compression tree, which utilizes full adders and half adders to turn a multi operand sum to two operand addition, which is further realized by using a final carry-propagate adder. In this paper we have considered an approximate **multiplier** which uses a speculative functional unit.

Show more
10 Read more