A.1 used OCSN frame types
10.3 Example Microcontroller Implementation for MRP
Showing that the MRP can support complex digital components is very important for the framework evaluation. Therefore, a small CPU has been ported to run as a distributed core onto the MRP. The used processor core was developed for teaching purposes by the Computer Engineering group of the Helmut Schmidt University in Hamburg. It supports 16 32bit registers, a 32bit ISA, a 32bit databus, and a 16bit address bus. A simple assembler is available for easier software development.
To port the processor core onto the MRP the processor core has to be divided into its core parts, such as fetch and decode unit, control unit, register file, and ALU . These components have to be encapsulated into the CEB signal interface. The fetch and decode unit has to be divided into two units. One unit is responsible for fetching datawords from a RAM component within the OCSN using the CSN2OCSN bridge. The second one decodes the fetched words for the datapath of the processor core. The control unit was extended by two states in its FSM to use the additional fetch stage, enforced by the
OCSN access.
The fetch unit is accessible from the OCSN to select the address of the OCSN RAM component and its port. Additional command frames are available to start, stop, and reset the proccessor core. This is necessary because programms running on the MRPs host system shall manage the processor core and its software. Figure 10.4 presents the
MRP configuration for the processor core. All components, except the ALU , fit into
the CEBs of CSN switch 0. The ALU is configured into CEB 1 of switch 1. Without the MRP and configured as a SoC onto a Xilinx Virtex5 FPGA the processor core can run at 30Mhz. Hence, 25Mhz is the maximum frequency of the core on the MRP. Using
10.3 Example Microcontroller Implementation for MRP
yellow CSN Switch 0, red CSN Switch 1, green CSN Switch 2, purple CSN Switch 3
10 Evaluation 31...28 27..24 23...20 19..16 31...28 27..24 23...20 19..16 7...4 15...12 31...28 27..24 23...20 19..16 31...28 27..24 23...20 19..16 7...4 15...12 3...0 3...0 11...8 11...8 0 1 2 3 0 1 3 2 0 1 2 3 0 1 3 2 CSN SW 0 CSN SW 1 CSN SW 2 CSN SW 3 1.2.2.2.1.0 1.2.2.2.2.0 1.2.2.2.3.0 1.2.2.2.4.0 CSN2OCSN CSN2OCSNsimple 1.2.2.2.5.9 1.2.2.2.6.0 15...12 15...12
Fetch Control ALU
Decode RegFile
Figure 10.4: MRP CPU Configuration
the propagation delay matrix in Table 10.3 one can look up the maximum path delay between all components. The ALU is connected to the control unit, the decode unit and the register file. The maximum propagation delay between these components is 10.62ns. We have to take into account that the ALU is a combinational circuit. So the maximum possible clock frequency is 10.62×21 = 47M hz, but the processor can not run at this speed.
The software running on the host system of the MRP is responsible for programming the fetch unit, start the processor core, and stop it after program execution. Further it emulates an OCSN RAM interface to supply the processor core with an easy to debug memory. At program start, the internal RAM buffer is filled from a file given on the command line. The program uses socket programming to communicate through the
OCSN with the fetch unit. It programs the fetch unit to use the host system at OCSN
port 100 as its RAM , and starts the processor core. After that it waits for RAM requests from the fetch unit and serves the correct data.
Multiple programs were executed on the distributed processor core without any prob- lems, such as a simple multiplication and printing the fibonacci progression of f ib(33).
The processor was also tested against the OCSN2BRAM component, which improves execution speed because the RAM is not emulated in software. There are more perfor- mance improvements possible, such as implementing a small cache into the fetch unit or
10.3 Example Microcontroller Implementation for MRP
extending the number of registers by adding another register file component.
This example system shows, that is is possible to run complex distributed components onto the MRP. The divided processor core easily fits into the five CEBs.
11 Conclusion
This thesis addresses the usage of partial runtime reconfiguration in a general-purpose environment, such as standard personal computers. Such hybrid-hardware systems are commonly used for high performance computing, single-purpose computers and multi- purpose computers, but not in general-purpose computers yet. Image processing ap- plications, simulation of electromagnetic fields, solid state physics and computer games among others can benefit from this integration by bringing their own hardware accel- erators. These accelerators can be simple filter algorithms implemented in hardware or many streaming processors tightly interconnected. The requirements for hybrid hard- ware systems in general-purpose computing are different from high performance comput- ing. Application software changes very fast in general-purpose computing. The process- ing tasks are very variable in contrast to high performance computing. Therefore, many components in many different sizes have to be configured into the runtime reconfigurable hardware. This requirement leads to the granularity problem of runtime reconfigurable design flows. The effects of this problem can be reduced using the grouping and the gran- ularity solution presented in Chapter 6. Platform independence is another requirement in general-purpose computing because many CPU and FPGA vendors exist. OS inte- gration is also very important to get a wide acceptance of the reconfigurable hardware by developers and users.
In this thesis a multi FPGA framework, called MRP, is presented. It uses the granu- larity solution (Chapter 6) to build an easy extensible reconfigurable system for general- purpose computing. In contrast to many other reconfigurable systems it supports a packet switched network spanning multiple FPGAs. This network features fast inter- connection links up to 4.8Gbit/s. It supports a bridge to 1Gbit/s Ethernet. Through the Ethernet it is connectable to offboard host systems, such as a workstation or server. An onboard host system using a PRHS SoC is also available. Operating system sup- port for the OCSN is available, enabling users and developers to access any component connected to the OCSN using BSD socket programming. This easy access supports the platform independence because it standardises hardware access to a common API . No other RS has this kind of OS integration. The MRP is divided into support and reconfiguration platforms. The first provides access to FPGA board resources like RAM or storage devices, while the second provides the runtime reconfigurability. The recon- figuration platform is implemented using the PR design flow of Xilinx Virtex5 FPGAs. Therefore, it is partitioned into many same sized RMs, called CEBs. These CEBs are interconnected using a CSN and a common signal interface. Through this buildup they reduce the effects of the granularity solution. Components, to be used on the MRP, have to be divided into smaller components fitting into a CEB. Through the CSN they are interconnected to form the complex component again.
11 Conclusion
Chapter 10 evaluates the MRP according the area usage, maximum clock speed mea- surement and an example CPU based application.
The example MRP system, presented in this thesis, requires 75% of a Xilinx xc5vlx330 Virtex5 FPGA. The OCSN uses the most of this space (43.31%). But this investment in area provides a very flexible and fast interconnection network with unique features. The actual hardware providing the runtime reconfiguration uses 54.66% of the used area. This area can be divided into 32.8% for the CEBs and 21.86% for the CSN . This is a hardware overhead of 0.6, but there is still improvement potential by increasing the number of CEBs per switch and optimizing the switch implementation.
Table 10.3 presents a matrix of the propagation delays of all possible CEB connections. The minimum clock frequency for CEBs connected to one switch is 135MHz using se- quential circuits only and 67MHz with at least one combinational circuit. The maximum clock rates are 162MHz and 81MHz. Common clock rates for normal FPGA designs on a Virtex5 range from 25MHz up to 200MHz for very optimised designs. Hence, the measured minimum and maximum clock rates range in between. A reduced clock rate is the price for the improved flexibility.
The last evaluation property is a complex example application. A 32bit microcon- troller for teaching purposes has been ported to the MRP. It is divided into the five
CEBs, fetch unit, decode unit, control unit, register file and ALU . The fetch unit re-
quests datawords from OCSN components providing RAM , such as the OCSN 2BRAM device. It is even possible to emulate a RAM on the host system using a user space program. An application on the host system loads the microcontroller program into some RAM , instantiates all the microcontroller components within the MRP and starts it. Programs like a simple multiplication or calculating the fibonacci progression run on this distributed microcontroller without any problems.
This evaluation shows that the MRP fullfils the requirements for a RS in a general- purpose environment. The implementation of the MRP can be seen as a success.
11.1 Outlook
The development of the MRP is finished, but many development steps to integrate runtime reconfiguration into general-purpose computing need to be done.
OS support for runtime reconfiguration needs to be improved. At the moment recon-
figuration is not part of any modern OS . Most research concerning this topic is done to evaluate reconfiguration speed and schedule reconfigurable hardware like processes, but this approach is not feasible at the moment because reconfiguration times are not fast enough (see Table 1.1). Therefore, a more general approach would be better suited, such as looking at reconfigurable hardware more like a memory resource, not like a process. In this way reconfigurable hardware could be requested in a malloc style.
The MRP provides many CEBs for configuration. These CEBs are very similar to the CLBs of the FPGA infrastructure. Another field of research could be to implement a synthesis, placing and routing environment based on the MRP. The first step would be to design a generic CEB component, which could be the target of the synthetisation
11.1 Outlook
process. The source of this process could be a hardware description in a HDL or even a C program would be possible. Such a process enables the developer to optimise the implementation from two different directions, from the hardware and software side.
Another research topic could be to implement runtime reconfigurable processors onto the MRP. Some basic approaches to runtime reconfigurable processors have been made by Dales[16], Hauser et al. [17], Razdan[18], Hallmanseder[15] and Niyonkuru[44]. These approaches could be advanced and tested on the MRP because it provides the basic in- frastructure for this research. The implemented microcontroller system is divided into some individually reconfigurable CEB. This is a base requirement for all the reconfig- urable processors.
Appendix
A OCSN Frame Types
Table A.1 shows all, at the moment assgined, frame types.
Type ID Protocol Description
0 MAC used at the data-link layer for identifying remote interfaces
and flow control
1 ICMP used at the application layer for ping like operation
2 LED application layer protocol for communication with LED com-
ponent
3 DATA application layer protocol for communication with RAM de-
vices
4 CEB application layer protocol for communucation with CEBs
5 ICAP application layer protocol for communication with ICAP de-
vices
6 CSN SW application layer protocol for communication with CSN
switch
Bibliography
[1] Wikipedia, “14 nanometer — wikipedia, the free encyclopedia,” May 2014. [Online]. Available: http://en.wikipedia.org/w/index.php?title=14 nanometer& oldid=599971737
[2] Xilinx, Inc., Partial Reconfiguration User Guide, 2010, http://www.xilinx.com. [3] ——, Virtex-5 FPGA User Guide, 2012, http://www.xilinx.com.
[4] D. Gohringer, M. Hubner, V. Schatz, and J. Becker, “Runtime adaptive multi- processor system-on-chip: Rampsoc,” in Parallel and Distributed Processing, 2008.
IPDPS 2008. IEEE International Symposium on, Apr. 2008, pp. 1 –7.
[5] M. Eckert, “Fpga-based system virtual machines,” Ph.D. dissertation, Helmut- Schmidt-Universit¨at/Universit¨at der Bundeswehr Hamburg, 2014.
[6] Convey Computer Corporation, Convey Personality Development Kit Reference
Manual, December 2010, http://www.conveycomputer.com.
[7] Xilinx Zynq Product brief, Xilinx Inc., Xilinx Inc., 2100 Logic Drive, San Jose, CA 95124, USA. [Online]. Available: http://www.xilinx.com/products/silicon- devices/soc/zynq-7000/
[8] G. E. Moore, “Cramming more components onto integrated circuits,” Electronics, vol. 38, no. 8, pp. 114–117, 1965.
[9] M. Bohr, R. Chau, T. Ghani, and K. Mistry, “The high-k solution,” Spectrum,
IEEE, vol. 44, no. 10, pp. 29 –35, oct. 2007.
[10] Sun Microsystems, Inc., “Opensparc t2 processor design and verification users’s guide,” November 2008, https://www.opensparc.net/.
[11] NVIDIA Corporation, “Nvidia’s next generation cuda compute architecture: Fermi,” 2009, http://www.nvidia.com/.
[12] C. Kao, “Benefits of partial reconfiguration,” Xcell journal, vol. 55, pp. 65–67, 2005. [13] J. Von Neumann, “First draft of a report on the edvac,” IEEE Annals of the History
Bibliography
[14] K. Williston, “Roving reporter: Fpga + intel atomR TM = configurable processor,” http://embedded.communities.intel.com/community/en/hardware/blog/2010/12/ 10/roving-reporter-fpga-intel-atom-configurable-processor, Dec. 2010. [Online].
Available: http://embedded.communities.intel.com/community/en/hardware/
blog/2010/12/10/roving-reporter-fpga-intel-atom-configurable-processor
[15] D. Hallmannseder and B. Klauer, “Compilerunterst¨utzung f¨ur die Dynamische Rekonfiguration eines Mikroprozessors,” in PII Workshop. Hamburg: Technische Informatik, Helmut-Schmidt-Universit¨at, 2009.
[16] M. Dales, “The proteus processor - a conventional cpu with reconfigurable func- tionality,” in FPL ’99: Proceedings of the 9th International Workshop on Field-
Programmable Logic and Applications. London, UK: Springer-Verlag, 1999, pp. 431–437.
[17] J. R. Hauser and J. Wawrzynek, “Garp: A mips processor with a reconfigurable coprocessor,” in Proceedings of the FCCM’97, 1997, pp. 12–21.
[18] R. Razdan, “Prisc: programmable reduced instruction set computers,” Ph.D. dis- sertation, Harvard University, Cambridge, MA, USA, 1994.
[19] D. Gohringer, M. Hubner, T. Perschke, and J. Becker, “New dimensions for multi- processor architectures: On demand heterogeneity, infrastructure and performance through reconfigurability; the rampsoc approach,” in Field Programmable Logic and
Applications, 2008. FPL 2008. International Conference on, Sep. 2008, pp. 495 –
498.
[20] B. Venners, Inside the Java Virtual Machine. New York, NY, USA: McGraw-Hill, Inc., 1996.
[21] T. Schwederski and M. Jurczyk, Verbindungsnetze, ser. Leitf¨aden der Informatik. Teubner, 1996.
[22] T.-Y. Feng, “A survey of interconnection networks,” Computer, vol. 14, no. 12, pp. 12–27, 1981.
[23] K. Compton and S. Hauck, “Reconfigurable computing: a survey of systems and software,” ACM Computing Surveys, vol. 34, no. 2, pp. 171–210, 2002, an excellent survey paper on reconfigurable computing.
[24] H.-D. Ebbinghaus, J. Flum, and W. Thomas, Einf¨uhrung in die mathematische Logik (5. Aufl.). Spektrum Akademischer Verlag, 2007.
[25] K. Urbanski and R. Woitowitz, Digitaltechnik: ein Lehr- und ¨Ubungsbuch, ser.
Engineering online library. Springer, 2004.
[26] A. Otero, E. de la Torre, and T. Riesgo, “Dreams: A tool for the design of dynami- cally reconfigurable embedded and modular systems,” in Reconfigurable Computing
[27] Altera Product Catalog, Altera Inc. [Online]. Available: http://www.altera.com/ literature/sg/product-catalog.pdf
[28] D. Bryant, “Disrupting the data center to create
the digital services economy,” June 2014. [Online]. Avail-
able: https://communities.intel.com/community/itpeernetwork/datastack/blog/ 2014/06/18/disrupting-the-data-center-to-create-the-digital-services-economy
[29] I. T. U. T. S. S. Itu-T, “X.200 : Information technology - open systems
interconnection - basic reference model: The basic model,” ISOIEC, no.
7498-1, p. 59, 1994. [Online]. Available: http://www.iso.org/iso/iso catalogue/ catalogue tc/catalogue detail.htm?csnumber=20269
[30] A. S. Tanenbaum, “Network protocols,” ACM Comput. Surv., vol. 13, no. 4, pp. 453–489, 1981.
[31] T. Bjerregaard and S. Mahadevan, “A survey of research and practices of network-on-chip,” ACM Comput. Surv., vol. 38, no. 1, 2006. [Online]. Available: http://doi.acm.org/http://doi.acm.org/10.1145/1132952.1132953
[32] K. C. Sevcik and M. J. Johnson, “Cycle time properties of the fddi token ring,”
IEEE Transactions on Software Engineering, vol. 13, 1987.
[33] W. H. Bahaa-El-Din and M. T. Liu, “Register-insertion: a protocol for the next generation of ring local-area networks,” Computer networks and ISDN systems, vol. 24, no. 5, pp. 349–366, 1992.
[34] H. Hellwagner and A. Reinefeld, SCI: Scalable Coherent Interface. Springer, 1999. [35] G. Barnes, R. Brown, M. Kato, D. J. Kuck, D. Slotnick, and R. Stokes, “The illiac iv computer,” Computers, IEEE Transactions on, vol. C-17, no. 8, pp. 746–757, Aug 1968.
[36] R. Knecht, “Implementation of divide-and-conquer algorithms on multiprocessors,” in Parallelism, Learning, Evolution, ser. Lecture Notes in Computer Science, J. Becker, I. Eisele, and F. M¨undemann, Eds. Springer Berlin Heidelberg, 1991, vol. 565, pp. 121–136. [Online]. Available: http://dx.doi.org/10.1007/3-540-55027-5 7 [37] N. Grebenjuk, “Conecting of ocsn to prhs framework,” Bachelor Thesis, Helmut
Schmid University, 2014.
[38] Wikipedia, “Linux — wikipedia, the free encyclopedia,” February 2014. [Online]. Available: http://en.wikipedia.org/w/index.php?title=Linux&oldid=597293747 [39] R. Biddappa, “Clock domain crossing,” The Cadence India Newsletter, pp. 2–8, May
2005. [Online]. Available: http://www.cadence.com/india/newsletters/icon 2005- 05.pdf
Bibliography
[40] C. E. Cummings, “Simulation and synthesis techniques for asynchronous fifo de- sign,” in SNUG 2002 (Synopsys Users Group Conference, San Jose, CA, 2002)
User Papers, 2002.
[41] A. Athavale and C. Christensen, High-speed serial I/O made simple.
[42] R. Love, Linux-Kernel-Handbuch: Leitfaden zu Design und Implementierung von
Kernel 2.6, ser. Open source library. Addison-Wesley, 2005.
[43] M. Ruffoni and A. Bogliolo, “Direct measures of path delays on commercial fpga chips,” in Signal Propagation on Interconnects, 6th IEEE Workshop on. Proceedings, may 2002, pp. 157 –159.
[44] A. Niyonkuru and H. C. Zeidler, “Designing a runtime reconfigurable processor for general purpose applications,” in IPDPS, 2004.