All these major shifts in the hardware landscape are allowing humankind to advance scientific computing, and they are enabling us to tackle ever-important problems with ever-larger volumes of data. At the same time, these shifts demand radical changes also on the software side. For instance, to improve the software performance at lower clock frequency, traditionally sequential code needed to be adapted to take advantage of multiple cores. To harness accelerators and their forms of parallelism, new programming models and algorithms have been developing. To take advantage of the new non-volatile memory, a new protocol and changes in user and kernel code have been carrying out. Therefore, while hardware heterogeneity has become key to design more efficient systems, it requires tighter integration between hardware and software. This integration is what we call hardware/softwareco-design. From an appliance standpoint, hardware/softwareco-design is the cooperative design of the hardware and software aimed to meet some require- ments by exploiting possible synergies between hardware and software. A non-exhaustive list of possible requirements includes: make the execution time shorter, reduce the energy-to-solution, or do not exceed a set power consumption limit. While the former requirement is the most com- mon, application executed at a large scale might be more interested in reducing the total energy consumed, even if this might imply longer execution times.
Hardware and software reuse was also examined at all levels of production. As the level of reuse increases, cost decreases due to design costs and software development costs decreasing. At low levels of production, reuse affects costs more because design cost and software development cost are both non-recurring and have fewer units to be amortized over. A change in those cost values has a greater affect on total cost values than at larger production rates. The effect of reuse at a medium level of production is shown in Figure 6. We can observe, for instance, that although the MIXED implementation may not be less expensive than the ASIC implementation with no reuse, if the JPEG encoder producer could increase levels of reuse to 40%, the MIXED implementation at 40% reuse could be less expensive than the ASIC implementation with no reuse. However, reuse of IP and software is effectively coupled in this approximation (which would not be the case in reality) so that 20% reuse in the ASIC
is the partitioning. This is not just partition- ing between hardware and software. It takes also reprogrammable hardware into account, letting designers analyse the additional trade- offs that this kind of design implies. Config- urable hardware has advantages and disadvan- tages: it enables some flexibility and the use of more complex scheduling algorithms, and increases the cost of debugging hardware, re- spectively. Therefore, many developers of ESs choose to run their applications on top of ei- ther an embedded operating system or real-time executive.
to implementation. The Electronic Design Automation tools for this kind of work are just becoming available together with very large Field Programmable Gate Arrays, such as Xilinx Virtex™, which will enable students to gain hands-on experience of hardware/softwareco-design relatively inexpensively. Probably, in the near future, the disciplines of hardware engineer and software engineer will disappear. In which case there will only be systems engineers who will use the system-on-chip approach. System-on-chip brings a fresh challenge to higher education in computer science and electronics. It is unlikely that students would have sufficient background in electronics or software to tackle a system-on-chip design until postgraduate level. Even at postgraduate level a double length project will be required in order to develop and implement a system-on-chip design. Group projects are probably the only way that a system-on-chip design could be brought to fruition in the limited time available to students and this is exactly what is being proposed for the course at Luton. The electronic design automation tools for system-on-chip design are very complex and will need some time for familiarisation. Therefore a careful balance h a s to be made between the theoretical and hands-on content of the course.
In these experiments, we compared three approaches: the re- sults from software only single ARM cortex-A9 processor, the dual ARM cortex-A9 processors approach and, finally, FPGA co-design approach. In Figure 4, the ARM15 curves indicate the experiment using duel ARM Cortex A9 processors. The Dcore curves indicates the experiments of using single ARM cortex core and the Codcore curves indicates the hardware- softwareco-design approach for pedestrian tracking. Overall, the best result was from the gym sequence (Figure 4 (C)) in those three experiments. Since the size of pedestrian objects in the video sequence blur body, dance2 and gym scene are 10% smaller than the size of detection windows (64 pixels by 128 pixels). It is noted that the results of video sequence of gym and dance2 scene have the same increasing rates for all three platforms. The precession rates of FPGA Co-design platforms (with legend Codcore15 in figure 4) are better than single ARM cortex-A9 system (With legend ARM15 in figure 4) and dual ARM cortex A9 processors (with legend dcore15 in figure4) due to the FPGA hardware acceleration advantage. Resultant success plots for those three scenes are shown in Fig.4 (D-F), which shows our implementation is in both scale and aspect ratio adaptability.
With increasing device densities, difficult challenges become feasible and the integration of embedded SoPC (System on Programmable Chip) systems is significantly improved. Reconfigurable systems on a chip have become a reality with softcore processor. Softcore processor is a microprocessor fully described in software, usually in VHDL description, and capable to be synthesized in programmable hardware, such as FPGA. Softcore processors can be easily customized to the needs of a specific target application. The two major FPGA manufacturers provide commercial softcore processors; Xilinx offers MicroBlaze processor, also, Altera offers Nios II processors . The benefit of a softcore processor is to add a micro-programmed logic that introduces more flexibility. A hardware / softwareco-design approach is then possible and a particular functionality can be developed in software for flexibility and upgrading completed with hardware custom blocks for cost reduction and performances.
Abstract- The implementation of a hardwaresoftwareco-design in the field of control systems is explained in this paper. The architecture of such systems is a key aspect in the design process. The system is designed based on the ATmega 328P microcontroller. The functionality of such a system is shown using a case study – Speed control of a DC motor under loaded and unloaded condition. Usage of minimal hardware and maximal software ensures reduction in hardware and lessens expenditure of the system and the dependency of the control system on the hardware is reduced. This makes the system design simplified and compacted. Purpose of speed control is to reduce the error and drive the motor at a set speed under loaded or unloaded condition to a certain extent. The speed control of DC motor is crucial in applications where precision is of essence.
Abstract-This paper presents a hardware/software (HW/SW) co-design approach using SOPC technique and pipeline design method to improve the performance of particle swarm optimization (PSO) for embedded applications. Based on modular design architecture, a particle updating accelerator module via hardware implementation for updating velocity and position of particles and a fitness evaluation module implemented on a soft-cored processor for evaluating the objective functions are respectively designed and work closely together to accelerate the evolution process. Thanks to a flexible design, the proposed approach can tackle various optimization problems of embedded applications without the need for hardware redesign. To compensate the deficiency in generating truly random numbers by hardware implementation, a particle re-initialization scheme is also presented in this paper to further improve the execution performance of the PSO. Experiment results have demonstrated that the proposed HW/SW co-design approach to realize PSO is capable of achieving a high-quality solution effectively.
The SMU Co-Design Project is an effort to target the problem of hardware/softwareco-design via an open source laboratory for studying hardware-software integration. The project focuses on the use of Model Driven Architectures (MDA) to define high-level model- based system descriptions that can be implemented in either hardware or software. Utilizing component and state diagrams based on the Unified Modeling Language (UML), we demonstrate MODCO, a transformation tool which takes a UML state diagram as input and generates HDL output suitable for use in Field Programmable Gate Array (FPGA) circuit design. With this tool as a first step, we plan to continue to bridge the gap between hardware and softwaredesign taking advantage of trends in both areas to work with higher level description languages and use software transformation tools to manage lower-level hardware or software implementation details. This project also serves as the basis for a new generation of software and computer engineers who understand each other’s problems and issues and are able to leverage the capabilities of model- based system description languages.
This chapter explains design and implementation. Section 4.1 shows implement envi- ronment, especially FPGA and camera module. ZedBoard has AXI bus as an on-chip interconnect protocol between software core and programmable logic. This chapter also explains detail of AXI protocol. Section 4.2 reviews system algorithm and discusses op- timization of it for the purpose of Hardware-SoftwareCo-design. Section 4.3 gives the result of system profiling, and examines which part of the program consumes most of the execution time. Section 4.4 shows Hardware-SoftwareCo-design based on algorithm op- timization and profiling result. Section 4.5 discusses data size of HOG feature values, and examines data reduction method. Finally, Section 4.6 explains detail of the architecture.
Hardware/softwareco-design architecture for two typical MIMO lattice decoding algorithms has been designed and implemented in this thesis. The closest lattice point searching procedure is partitioned into the FPGA-based hardware modules. And a MicroBlaze soft core is used for the channel matrix preprocessing and R/I decomposition. Three levels of parallel structures are designed in this co-design architecture to improve the decoding rate. The overheads involved in these parallel structures are also analyzed. The proposed architecture is prototyped on the Xilinx XUP Virtex-II Pro developing board with an XC2VP30 FPGA. The experimental results show that the AV and VB based decoders can reach up to 81.5 Mbps and 37.3 Mbps decoding rate respectively at 20 dB Eb/N0 for a 4 × 4 MIMO system with 16-QAM modulation, which are among the fastest MIMO decoders to the author’s knowledge. They are about 37 and 187 times faster than their respective implementations in a DSP processor. The BER performance of the experimental prototype matches with the software simulation results.
Figure 6 shows the data flow of the 3D graphics test benches. These test benches generate the 3D vertex data and configure to the GE/RE using the device driver at step 5. After that, GE and RE have numerous R/W operations to different blocks, such as 3D vertex buffer, 2D vertex buffer, Z buffer, and 32 ’ bit frame buffer, in main memory. These memory blocks are reserved for 3D graphics SoC, so no other else will access these blocks. The authors modify the SCI and QEMU inter- face to keep these operation within SystemC side. This modification will not affect the hardware and softwaredesign. This communication depends on the data flow of the target application, so it is called as the applica- tion-specific communication. Figure 17 shows the differ- ence in the data flow between the general and the application-specific communication. For instance, when the GE writes data into synchronous dynamic random
of performance information available. However, the Tensilica processors provide a continuous stream of debug information fully describing the processor’s activity. This data can be trans- lated into performance statistics that can be collected enabling the necessary performance profiling for effective co-tuning. Dynamic reconfiguration and detailed performance informa- tion enable our FPGA based emulation platform to achieve sim- ilar flexibility and state visibility found within software based methods. The value in the credibility of an FPGA based em- ulation goes beyond providing accurate performance projec- tions. From a practical standpoint, any co-design process must involve both hardware vendors and scientists that will, likely be from differing labs and industry partners. IP issues can be- come a barrier to innovation as hardware vendors are often re- luctant to share low-level details of future designs. To further enhance vendor interaction, the flexible nature of the tools used to rapidly prototype designs can be extended to provide a proxy architecture for vendor specific technologies. The availability of a proxy architecture model that is highly credible is a pow- erful tool to influence industry as the hardware designers and architects are not constrained by products that must fit into a vendors existing product roadmap.
Abstract – In this paper, the design and implementation results of a system on a chip (SOC) based speech recognition system as software/hardwareco-design is presented. The hidden markov model (HMM) is used for the speech recognition. In order to implement this in SOC, the various tasks required are optimally partitioned between hardware and software. The SOC, housed in Altera FPGA boards , has Nios II soft core processor. Custom hardware blocks are developed for computationally intensive blocks such as Viterbi decoder. The preprocessing and training of HMM are implemented in software (using C program). The Viterbi decoding is implemented in hardware as custom block for real time recognition. It is also implemented in software for verification and comparison. It is observed that the sequential hardware implementation of viterbi block is 80 times faster than the software approach using C program with UP3 kit. An over all recognition accuracy of 94.8% is achieved for speaker independent digit recognition for our own database of 6 speakers. Altera’s DE2 board with cyclone II FPGA is used to implement TI46 digit recognition. Since the logical elements in DE2 board is high compared to UP3 kit the viterbi decoding is implemented in parallel for 0-9 digits. Because of this speed of recognition is ‘772’ times faster than software implementation with cyclone II FPGA. And also it is observed that for TI-46 speech database for f1 speaker the recognition accuracy is 87% using LPC as feature extraction technique. Extension of this work for larger vocabulary size and using MFCC as feature extraction is under progress.
I. I NTRODUCTION
Wireless telecommunications and multimedia kernels often exhibit amounts of instruction and data level parallelism that can far outstrip the computational ability of high performance software programmable VLIW style DSP cores. As such, often portions of the application in question are accelerated in hardware to achieve the required performance. The cost in doing this is the loss in software programmability and the flexibility of a software implementation. At the same time, many companies view their software as intellectual property, and have a strong desire for a software based solution that is a proprietary implementation. Examples of this are channel equalization kernels for baseband processing in wireless infrastructure, whereby vendors often maintain their own proprietary version implemented in highly optimized software on one or more programmable DSP cores. In deciding to offload functionality to hardware acceleration there are a number of design decisions which must be considered, many of which are sacrificed in one implementation versus another. The paper presents application specific compiler driven architecture design space exploration for channel estimation in mobile wireless receivers. It compares the tradeoffs between compiler driven software programmable DSP implementations versus hardware based accelerator implementations. The stud- ies are based around the Texas Instruments TMS320C64x DSP architecture. For each workload, the compiler can de-
The Nios II is a soft core processor based on 32-bit RISC architecture from Altera for use in their FPGAs. The system designer can define a custom Nios II core based on the requirements of the application to be developed. Altera’s Quartus II and NIOS II Integrated Development Environment are the design tools used for building, debugging and running a Nios II system. SOPC builder is a powerful system development tool included in the Quartus II for creating systems including processors, peripherals, and memories. It enables us to define and generate a complete system-on-a-programmable-chip (SOPC) in much less time than using traditional, manual integration methods. For AES as a software component, its complete functionality is described in a higher level language i.e. C-language. The SOPC system for AES as software is depicted in Fig 3.2. The Nios II processor executes the C- code through the NIOS II IDE which is a software development tool. Performance of the system in terms of speed is calculated by a Performance counter core. This core can accurately measure execution time taken by multiple sections/modules of code. Realized in hardware. Fig 3.3 depicts the SOPC system for AES as Hardware-Software.
lower error-energy in comparison towards the existing calculations for DCT approximation. The decomposition process enables generalization from the suggested transform for greater-size DCTs. Curiously, suggested formula is definitely scalable for hardware in addition to software implementation of DCT of greater measures, and it could make utilization of the greatest of the existing approximations of 8-point DCT. In line with the suggested formula, we've suggested a completely scalable, reconfigurable, and parallel architecture for approximate DCT computation. One distinctively interesting feature of suggested design would be that the structure for that computation of 32-point DCT might be configured for parallel computation of two 16-point DCTs or four 8-point DCTs.
Skeleton Driven Simulation
It is common to use simulation to predict the performance of systems before they are available. Simulation of a full parallel application would be prohibitively expensive given that most pro- duction applications take a significant amount of time to execute on the bare system itself. The efficient utilization of large-scale parallel event simulators such as SST/macro  requires that skeleton models of underlying software systems and architectures be created. Implementing such models by abstracting the designs of large-scale parallel applications requires a substantial amount of manual effort and introduces human errors. Our approach reduces both the effort and likelihood of errors in the skeleton by using established algorithms for program dependency analysis and code generation. These skeleton models can then be combined with appropriate models of the software stack and system hardware to generate a wealth of information about execution pattern such as ap- plication communication characteristics and network utilization for high-performance computing architectures. This information is useful in understanding the impact of various design decisions concerning hardware or software and will enable co-design practices to be applied to the design of future exascale systems and provide an environment to prototype ideas for future programming models and software infrastructure for these machines.
Compare to traditional control system designs, the hardware and softwareco-design implementation ap- proach enjoys the following advantages: (1) Because the FPGA chip is used to process the measured signal from a sensor and generate the PWM signal, the micro- controller is released from handling unexpected inter- rupts due to the CCP module capturing signals and calculating PWM signals. Instead, it can focus on com- plicated algorithm for signal processing and control signal synthesis. The resultant software management and timing programming are easy to be dealt with. (2) The approach helps saving some resources of the mi- crocontroller. For example, a multi-channel encoder may require multi-channel capturers or accumulators in a microcontroller to capture the pulse inputs. Thus, the cost for the microcontrollers with more resources will be more expensive. In our co-design implementation, only parallel ports of the microcontroller were em- ployed to transfer data between the microcontroller and the FPGA chip. Most required functions are imple- mented using the FPGA chip via a hardware approach. Furthermore, in terms of hardwaredesign, the hard- ware description language and the FPGA chips provide convenient tools to construct extra desirable features for a project, for example, creating multi-channel input signal capturers or multi-channel PMW signal genera- tors, etc. (3) Programming with a microcontroller frees one from trivial and low-level floating-point computa- tion on an FPGA chip, which has been proved to be less efficient and more time consuming. In particular, for control algorithms that are highly dependent on floating-point arithmetic to guarantee the accuracy or require serial signal processing, using a microprocessor as an IP core in a system is a reasonable tradeoff be- tween design time and design cost.
The concept of GPIO (General Purpose I/O Port) is introduced; through which any custom hardware i.e. own designed hardware or IP core can be interfaced with the open source processor. AES encryption algorithm is selected as an IP core to be interfaced with LEON3 processor. AES is implemented in VHDL, while the control of the algorithm is in software. AES algorithm partitioned in hardware and software. The complete algorithm in hardware and control of algorithm in software. The part of algorithm in hardware is interfaced with the system designed using processor as a custom hardware and performance parameters studied. AES implemented using Codesign approach. AES is the latest encryption standard used to protect confidential information like financial data for government and commercial use. The LEON3 is a synthesizable VHDL model of a 32-bit processor available under the GNU GPL license. The design is implemented on Cyclone II FPGA from Altera Corporation.