Multicore Reconfiguration Platform —
A Research and Evaluation FPGA
Framework for Runtime Reconfigurable
Systems
Dipl.-Inf. Dominik Meyer
18. M¨
arz 2015
Multicore Reconfiguration Platform —
A Research and Evaluation FPGA Framework
for Runtime Reconfigurable Systems
Von der Fakult¨
at Elektrotechnik
der Helmut-Schmidt-Universit¨
at/
Universit¨
at der Bundeswehr Hamburg
zur Erlangung des akademischen Grades
eines Doktor-Ingenieurs
genehmigte
DISSERTATION
vorgelegt von
Diplom-Informatiker Dominik Meyer
aus Rendsburg
Gutachter Prof. Dr. Bernd Klauer
Prof. Dr. Udo Z¨olzer
Vorsitzender der Pr¨ufungskommission Prof. Dr. Gerd Scholl
Tag der m¨undlichen Pr¨ufung 16.03.2015
Gedruckt mit freundlicher Unterst¨utzung der HSU-Universit¨at der Bundeswehr
Curriculum Vitae
Personal informationSurname(s) / First name(s) Meyer, Dominik Email(s) [email protected] Nationality(-ies) German
Date of birth June 17, 1976 Education
Dates 1993 - 1997 Title of qualification awarded Abitur Name and type of organisation
providing education and training Helene Lange Gymnasium Rendsburg/ Germany Dates 1998 - 2008
Title of qualification awarded Diplom in Computer Science Name and type of organisation
providing education and training Christian-Albrechts-Universit¨at zu Kiel Work experience
Dates 2000 - 2003
Occupation or position held technical advisor/manager Main activities and
responsibilities Buildup and management of the server infratructure of aninternet service provider and webhoster. Name and address of employer PcW KG
Dates 2003 - 2009 Occupation or position held technical manager
Main activities and
responsibilities Buildup and management of the server infratructure of awebhoster. Development of firewall solutions. Name and address of employer die Netzwerkstatt
Dates 2009 - now
Occupation or position held research assistant Main activities and
responsibilities research in runtime reconfigurable systems
Name and address of employer Computer Engineering/ Helmut Schmidt University Hamburg
Publications
[1] Dominik Meyer. Runtime reconfigurable processors. Presentation at the Chaos Communication Camp, 2011. [2] Dominik Meyer. Introduction to processor design. Pre-sentation at the 30th Chaos Communication Congress, 2013.
[3] Dominik Meyer and Bernd Klauer. Multicore reconfig-uration platform an alternative to rampsoc. SIGARCH Comput. Archit. News, 39(4):102–103, December 2011.
Acknowledgments
This thesis is the result of my work at the Institute of Computer Engineering at the Helmut Schmidt University/ University of the Federal Armed Forces Hamburg.
I want to thank Prof. Dr. Bernd Klauer, my chair, for his support and the opportunity to work on this thesis. I also want to thank the remaining members of my dissertation
committee Prof. Dr. Scholl and Prof. Dr. Z¨olzer.
The discussions of my research results with my current and former colleagues at the Helmut Schmidt University helped a lot. Therefore, I want to thank Marcel Eckert, Rene Schmitt, Klaus Hildebrandt, Christian Richter and Jan Haase.
Finally, I want to thank my girl friend, Sarah Zingelmann, for her understanding and support during the last years.
Acronyms
Acronyms
AES Advanced Encryption Standard.
ALU Arithmetical Logical Unit.
AMBA Advanced Microcontroller Bus Architecture.
API Application Programming Interface.
BRAM Block RAM.
CAN Controller Area Network.
CDC Clock Domain Crossing.
CEB Configurable Entity Block.
CLB Configurable Logic Block.
CMT Clock Management Tiles.
CPLD Complex Programmable Logic Device.
CPU Central Processing Unit.
CSMA/CD Carrier Sense - Multiple Access / Collision Detection.
CSN Circuit Switched Network.
DDR Double Data Rate.
DIP Dual Inline Package.
DNF Disjunctive Normal Form.
DSP Digital Signal Processor.
FF FlipFlop.
FFT Fast Fourier Transformation.
FIFO First In First Out.
FPGA Field Programmable Gate Array.
FSM Finite State Machine.
GPIO General Purpose Input Output.
GPU Graphical Processing Unit.
HDL Hardware Description Language.
HSTL High-Speed Transceiver Logic.
HTTP Hypertext Transfer Protocol.
I2C Inter-Integrated Circuit.
IC Integrated Circuit.
ICAP Internal Configuration Access Port.
ILP Instruction Level Parallelism.
IOB Input/Output Block.
Acronyms
ISA Instruction Set Architecture.
ISO International Organization for Standardization.
ITU International Telecommunication Union.
LAN Local Area Network.
LED Light Emitting Diode.
LUT LookUpTable.
LVDS Low-Voltage Differential Signaling.
LVTTL Low-Voltage Transistor Transistor Logik.
MAC Media Access Control.
MPSoC Multi-Processor System-on-Chip.
MPU Multiplyer Unit.
MRP Multicore Reconfiguration Platform.
NOC Network On Chip.
OCSN On Chip Switching Network.
OS Operating System.
OSI Open Systems Interconnection Model.
PAL Programmable Array Logic.
PCI Peripheral Component Interconnect.
PCIe Peripheral Component Interconnect Express.
PE Processing Element.
PLA Programmable Logic Array.
POP3 Post Office Protocol Version 3.
PR Partial Reconfiguration.
PRHS Partial Reconfiguration Heterogenous System.
RAM Random Access Memory.
RampSoC Runtime adaptive multiprocessor system-on-chip.
RC Reconfigurable Computing.
RM Reconfigurable Module.
RO Ring Oscillator.
RS Reconfigurable System.
RTL Register Transfer Layer.
SATA Serial Advanced Technology Attachment.
SCI Scalable Coherent Interface.
SoC System on Chip.
SPI Serial Peripheral Interface.
Acronyms
TCP Transmission Control Protocol.
UART Universal asynchronous receiver/transmitter.
UDP User Datagram Protocol.
USB Universal Serial Bus.
VA Virtual Architecture.
VHDL Very High Speed Integrated Circuits HDL.
VR Virtual Region.
WAN Wide Area Network.
XDL Xilinx Description Language.
List of Figures
1.1 History of the ic processing size[1] . . . 1
1.2 partitioning of an FPGA for the Xilinx PR design flow[2] . . . 3
2.1 and/or Matrix . . . 9
2.2 Halfadder implemented in an and/or Matrix . . . 10
2.3 4 to 1 Multiplexer . . . 11
2.4 Cascaded 4 to 1 Multiplexer . . . 12
2.5 Simple structure of an FPGA without interconnects . . . 13
2.6 Structure of two Virtex5 CLBs[3] . . . 14
2.7 simple PR example[2] . . . 16
3.1 example RAMPSoC Configuration[4] . . . 17
3.2 PRHS System Overview[5] . . . 19
3.3 Overview of the Convey HC1 architecture[6] . . . 21
3.4 Structure of an Intel Stellarton Processor, combined with an Altera FPGA 22 3.5 Structure of the Xilinx Zynq architecture[7] . . . 23
3.6 COPACOBANA and RIVYERA interconnection overview . . . 24
4.1 Example mobile phone SystemOnChip (SoC) . . . 25
4.2 graphical representation of the ISO/OSI Model . . . 27
4.3 direct and indirect interconnection networks . . . 29
5.1 Example Ring network with eight nodes . . . 39
5.2 Example bus with 4 nodes . . . 40
5.3 Example grid networks with 16 nodes . . . 43
5.4 Example tree networks . . . 45
5.5 Example 4×4 crossbar networks . . . 46
6.1 Example granularity problem . . . 48
6.2 Example grouping solution configuration . . . 49
6.3 Example granularity solution configuration . . . 51
6.4 Area requirements of the different usage patterns . . . 52
7.1 Example MRP System Overview . . . 53
7.2 OCSN frame description . . . 56
7.3 OCSN network structure overview . . . 56
LIST OF FIGURES
7.5 Example support platform . . . 59
7.6 Example reconfiguration platform . . . 62
7.7 CEB Signal Interface . . . 63
7.8 CSN group . . . 65
7.9 full MRP design flow . . . 68
7.10 reduced MRP design flow . . . 69
8.1 Clock Domain Crossing (CDC) component interface . . . 71
8.2 Dual Port Block RAM interface . . . 72
8.3 SimpleFiFo interface . . . 73
8.4 Reception of one OCSN Frame . . . 73
8.5 OCSN physical transmission component . . . 74
8.6 OCSN physical reception component . . . 74
8.7 Flowchart of OCSN identification protocol . . . 75
8.8 Flowchart of OCSN flow control protocol . . . 76
8.9 OCSN IF signal interface . . . 77
8.10 OCSN IF implementation schematic . . . 78
8.11 Graph of the OCSN IF FSM . . . 79
8.12 signal interface of an OCSN Switch . . . 80
8.13 signal interface of the addr compare component . . . 80
8.14 OCSN switch implementation schematic . . . 81
8.15 OCSN application component basic schematic . . . 84
8.16 OCSN Ethernet Bridge FSMs . . . 85
8.17 OCSN Ethernet Discovery Protocol . . . 86
8.18 Crossbar Interconnection Schema . . . 87
8.19 CSN Crossbar Switch Signal Interface . . . 88
8.20 CSN Crossbar Switch Implementation Schematic . . . 89
8.21 CSN2OCSN Bridge Signal Interface . . . 90
10.1 MRP Measurement Configuration for Setup 1 . . . 101
10.2 Floorplan of the reconfiguration platform . . . 103
10.3 Floorplan with interconnects of the reconfiguration platform . . . 105
List of Tables
1.1 Configuration speed and -time for a Xilinx xc5vlx330 FPGA . . . 2
1.2 Configuration speed and -time for a Xilinx xc5vlx330 FPGA with 0,25MB Data . . . 3
2.1 Truth table of a Halfadder . . . 10
2.2 different Boolean functions implemented with a 4 to 1 multiplexer . . . . 11
2.3 Example LUT implementing∧,∨ and⊕ . . . 13
5.1 Classification of a bidirectional ring . . . 39
5.2 Classification of a bus . . . 42
5.3 Classification of an open grid (mesh) with 4×4 nodes . . . 43
5.4 Classification of a closed grid (illiac) with 4×4 nodes . . . 44
5.5 Classification of a tree . . . 44
5.6 Classification of a crossbar network with n nodes . . . 46
7.1 variable speed of the OCSN . . . 55
8.1 Address to register mapping . . . 91
10.1 Area usage of the MRP . . . 98
10.2 Maximum clock rates within each switch . . . 101
10.3 Propagation Delay Matrix for all CEBs in ns . . . 102
Contents
List of Figures xiii
List of Tables xv
1 Introduction 1
1.1 Reconfigurable Hardware . . . 2
1.1.1 Runtime Reconfiguration . . . 2
1.2 Hybrid Hardware Approaches . . . 3
1.2.1 Datapath Accelerators . . . 4 1.2.2 Bus Accelerators . . . 4 1.2.3 Multicore Reconfiguration . . . 4 1.3 Thesis Objectives . . . 5 1.4 Thesis Structure . . . 7 2 Reconfiguration Fundamentals 9 2.1 Matrix Approach . . . 9 2.2 Multiplexer Approach . . . 11
2.3 Look Up Table Approach . . . 12
2.4 Field Programmable Gate Arrays . . . 13
2.4.1 Input/Output Blocks . . . 14
2.4.2 Configurable Logic Blocks . . . 14
2.4.3 Block RAM . . . 15
2.4.4 Special IO Components . . . 15
2.4.5 Interconnection Network . . . 15
2.5 Partial Reconfiguration . . . 16
3 Example Reconfigurable Systems 17 3.1 Research Systems . . . 17 3.1.1 RampSoC . . . 17 3.1.2 PRHS . . . 18 3.1.3 Dreams . . . 20 3.2 Commercial Systems . . . 20 3.2.1 Convey HC1 . . . 20 3.2.2 Intel Stellarton . . . 21
3.2.3 Xilinx Zynq Architecture . . . 22
Contents
4 Interconnection Networks 25
4.1 Open Systems Interconnection Model . . . 26
4.1.1 Application Layer . . . 27
4.1.2 Presentation Layer . . . 27
4.1.3 Session Layer . . . 27
4.1.4 Transport Layer . . . 28
4.1.5 Network Layer . . . 28
4.1.6 Data Link Layer . . . 28
4.1.7 Physical Layer . . . 28
4.2 Topology . . . 29
4.2.1 Interconnection Type . . . 29
4.2.2 Grade and Regularity . . . 30
4.2.3 Diameter . . . 31 4.2.4 Bisection Width . . . 31 4.2.5 Symmetry . . . 32 4.2.6 Scalability . . . 32 4.3 Interface Structure . . . 32 4.3.1 Direct Networks . . . 33 4.3.2 Indirect Networks . . . 33 4.4 Operating Mode . . . 33
4.4.1 Synchronous Connection Establishment . . . 33
4.4.2 Synchronous Data Transmission . . . 33
4.4.3 Asynchronous Connection Establishment . . . 33
4.4.4 Asynchronous Data Transmission . . . 33
4.4.5 Mixed Mode . . . 34 4.5 Communication Flexibility . . . 34 4.5.1 Broadcast . . . 34 4.5.2 Unicast . . . 34 4.5.3 Multicast . . . 34 4.5.4 Mixed . . . 34 4.6 Control Strategy . . . 35 4.6.1 Centralised Control . . . 35 4.6.2 Decentralised Control . . . 35
4.7 Transfer Mode and Data Transport . . . 35
4.8 Conflict Resolution . . . 36
5 Example Network On Chip Architectures 39 5.1 Ring . . . 39
5.2 Bus . . . 40
5.2.1 Bus-Arbitration . . . 41
5.2.2 Data Transmission Protocol . . . 41
5.2.3 Classification . . . 42
Contents
5.4 Tree . . . 43
5.5 Crossbar . . . 44
6 Granularity Problem of Runtime Reconfigurable Design Flow 47 6.1 Solutions . . . 49
6.1.1 Grouping Solution . . . 49
6.1.2 Granularity Solution . . . 50
6.2 Granularity Problem and Hybrid Hardware . . . 51
7 Multicore Reconfiguration Platform Description 53 7.1 On Chip Switching Network . . . 54
7.1.1 Physical Layer . . . 55 7.1.2 Data-link Layer . . . 55 7.1.3 Network Layer . . . 55 7.1.4 Transport Layer . . . 57 7.1.5 Session Layer . . . 58 7.1.6 Presentation Layer . . . 58 7.1.7 Application Layer . . . 58 7.2 Support Platform . . . 58 7.2.1 GPIO . . . 59 7.2.2 BRAM . . . 60 7.2.3 DDR3 RAM . . . 60 7.2.4 UART Bridge . . . 60 7.2.5 Ethernet Bridge . . . 61 7.2.6 Soft-core SoC . . . 61 7.3 Reconfiguration Platform . . . 61 7.3.1 ICAP . . . 62 7.3.2 CEB . . . 62 7.3.3 CSN . . . 64 7.3.4 IOB . . . 66
7.4 Operating System Support . . . 67
7.5 Design Flow . . . 68
8 Implementation of the Multicore Reconfiguration Platform 71 8.1 General Components . . . 71
8.1.1 Clock Domain Crossing . . . 71
8.1.2 Dual Port Block RAM . . . 72
8.1.3 FiFo Queue Component . . . 72
8.2 OCSN . . . 73
8.2.1 OCSN Physical Interface Components . . . 73
8.2.2 OCSN Data-Link Interface Component . . . 75
8.2.3 OCSN Network Component . . . 80
Contents
8.3 CSN . . . 86
8.3.1 Physical Layer Implementation . . . 87
8.3.2 Network Layer Components . . . 87
8.3.3 Application Layer Components . . . 89
9 Operating System Support Implementation 93 9.1 OCSN Network Driver . . . 94
9.2 OCSN Network Device Driver . . . 96
10 Evaluation 97 10.1 Area Usage . . . 97
10.2 Maximum CSN Propagation Delay Measurement . . . 99
10.2.1 RO-Component . . . 99
10.2.2 ReRouter-Component . . . 100
10.2.3 Measuring Setup . . . 100
10.2.4 Measurement Results . . . 100
10.3 Example Microcontroller Implementation for MRP . . . 104
11 Conclusion 109 11.1 Outlook . . . 110
Appendix 113 A OCSN Frame Types . . . 113
1 Introduction
Gordon E. Moore[8] stated in 1965 in the growingIntegrated Circuit (IC)market context:
“The complexity for minimum component costs has increased at a rate of roughly a factor of two per year.” The main conclusion of his paper is that the density of transistors on
a IC periodically doubles. This prediction still holds after 48 years, according to Intel
employees Mark T. Bohr, Robert S. Chau, Tahir Ghani and Kaizad Mistry[9].
ICs, such as general-purpose processors, are now produced in a 14nm technology
process. Figure 1.1 displays the history of processing sizes for ICs of the last decades.
With every doubling of the transistor density, more logic components can be placed
onto one IC. Processor designers are using this newly available space to add more and
more Central Processing Unit (CPU) and Graphical Processing Unit (GPU) cores to
processors. For example the OpenSPARC T2 processor[10] has 8 CPU cores, and the
NVIDIA Fermi device[11] even has 512 GPU cores. This development is expected to
continue for a while, equipping general-purpose processors with more parallel computing
power. System on Chips (SoCs)are another product of the available space onICs. They
feature single and multicore processors combined with aGPU and additional accelerator
hardware. This accelerator hardware improves the computing power withDigital Signal
0 2000 4000 6000 8000 10000 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 Size in nm Year
1 Introduction
File Size (MB) Interface Bit-width Clk (MHz) Speed (Mb/s) Time (ms)
9,6 SelectMap 8 50 400 192
9,6 SelectMap 16 50 800 96
9,6 SelectMap 32 50 1600 48
Table 1.1: Configuration speed and -time for a Xilinx xc5vlx330 FPGA
Processors (DSPs)or other mathematical functions implemented in hardware.
Beyond exploiting the available space with more and more static hardware, it can also be used for adding reconfigurable hardware.
1.1 Reconfigurable Hardware
Reconfigurable hardware has the ability to change its function after chip assembly and
allows the configuration of every digital circuit, such as Advanced Encryption Standard
(AES)-, Fast Fourier Transformation (FFT) accelerators, other DSP like instructions
and even some specialised CPU cores. The industry has already reacted to the
impor-tance of reconfigurable hardware and produces different types of standalone ICs with
this feature. One example is theField Programmable Gate Array (FPGA). It features a
large reconfigurable hardware area, some accelerator components likeArithmetical
Log-ical Unit (ALU) and Multiplyer Unit (MPU), and distributedRandom Access Memory
(RAM). Chapter 2 gives a more detailed introduction to reconfigurable hardware and
commercially available ICs. From now on, we will use FPGA as a synonym for
recon-figurable hardware.
One important limitation ofFPGAs was that they had to be reconfigured completely,
even for small system changes. Every computation taking place in hardware had to be stopped and a programming file, representing the changed functionality, was loaded into
the FPGA. Even, if only half of the reconfigurable area was computing and the other
half was without functionality, the whole area had to be replaced. This was and still is a very time intensive task. It takes many milliseconds for the reconfiguration process to complete, depending on the size of the file and the configuration channel. This process erases the internal states of all configured hardware components. Table 1.1 presents the
calculated minimal configuration times for a Xilinx FPGA and a 9,6MB configuration
file using the fastest available configuration interface.
1.1.1 Runtime Reconfiguration
Because of the configuration time limitation and to enable replacing one part of a design while other parts are still doing computations, hardware vendors introduced the concept
ofruntime reconfiguration. Runtime reconfiguration is also often referenced as dynamic
reconfigurationorpartial runtime reconfiguration. Such a runtime reconfigurable project
1.2 Hybrid Hardware Approaches FPGA RM1 RM0 RM00.bit RM01.bit RM02.bit RM03.bit RM10.bit RM11.bit RM12.bit RM13.bit ,,static´´ Logic
Figure 1.2: partitioning of an FPGA for the Xilinx PR design flow[2]
the design phase. Figure 2.7 shows an example partitioning of a FPGA for use with
the Xilinx Partial Reconfiguration (PR) design flow[2]. This design flow targets partial
reconfiguration for XilinxFPGAs. Two different sizedRMsare available, each connected
to some special “static” control hardware.
This feature does not speed up the configuration process itself, but through the parti-tioning of the reconfigurable area the size of the individual configuration stream shrinks,
which reduces the time for the reconfiguration process of one RM. For example, if you
can reduce the size of the configuration stream for one RM to 0,25 MB, you achieve
the configuration times of Table 1.2. This is an enormous speed up, but it can only be
achieved, if the design is apportionable and the RMs can be reconfigured individually
rather than all at once.
The partitioning of aFPGAcan only be altered by a full replacement of the configured
logic. More benefits ofPR are summarized by Kao[12].
1.2 Hybrid Hardware Approaches
Systems combining a general-purpose von Neumann[13] CPU with some kind of
config-urable or reconfigconfig-urable area are often called Hybrid Hardware Systems.
The industry has already produced some hybrid systems, such as the Xilinx Zynq architecture[7], the Intel Atom processor E6X5C series[14] and the Convey HC1/HC2[6].
The first combines an ARM Cotex A9 processor core with a XilinxFPGA on the same
File Size (MB) Interface Bit-width Clk (MHz) Speed (Mb/s) Time (ms)
0,25 SelectMap 8 50 400 5
0,25 SelectMap 16 50 800 2,5
0,25 SelectMap 32 50 1600 1,25
Table 1.2: Configuration speed and -time for a Xilinx xc5vlx330 FPGA with 0,25MB Data
1 Introduction
chip, but not on the same die. The next combines an Intel Atom processor with an
AlteraFPGAin the same manner. The last interconnects one Intel Xeon processor with
four Xilinx FPGAs through the Intel co-processor interface. Still missing are hybrid
hardware systems combined on a single die.
Extending a static processor core with some kind of reconfigurable hardware has al-ready been the focus of research. The following classes of combining strategies have already been evaluated.
1.2.1 Datapath Accelerators
Hallmannseder[15], Dales[16], Hauser et al. [17] and Razdan[18] added reconfiguration
directly into processor cores by adding reconfigurable accelerator units to the datapath of the processor. These units are small and cannot be merged to form larger ones. They
improve the processor performance by exploiting Instruction Level Parallelism (ILP)
through additional computational datapath units, or by extending the Instruction Set
Architecture (ISA)with special instructions. Examples of these special instructions are
cryptograhic accelerators for AES and mathematical accelerators for FFT. Datapath
accelerators can improve the performance the most, if they are tightly integrated into the processor core without long interconnects.
1.2.2 Bus Accelerators
Bus accelerators are small to medium-sized reconfigurable components and can be con-figured with specialised hardware to improve the runtime of a specific part of a program. They are connected through a bus or a network to the processor. These accelerators have to work independently on some part of data because of the high bus/network la-tency. This can release the static core(s) of some portion of parallel computable data. Because of the independent nature of these accelerators, they have an internal state and sometimes a connection to the main memory of the system. Bus Accelerators are a very simple form of extending the performance of processor cores because existing Busses,
like Peripheral Component Interconnect (PCI) or Universal Serial Bus (USB), can be
used, but more tightly coupled interconnects are also possible.
1.2.3 Multicore Reconfiguration
TheRuntime adaptive multiprocessor system-on-chip (RampSoC)framework of Gohringer
et al.[4, 19] evaluates themulticore reconfiguration approach. With Multicore
Reconfig-uration, multiple processor cores can be configured at system runtime. The system can adjust itself to the nature of the current problem to solve. Some kind of dynamic or
run-time reconfiguration design flow implements RMs, each containing one processor core.
These processor cores are called softcores because they are not staticly implemented.
The size of the largest one defines the size of the smallestRM, if every processor core
1.3 Thesis Objectives
sized processor cores, but this reduces the number of usable processor cores of the same size.
1.3 Thesis Objectives
Most of the research about hybrid hardware systems focuses on one combining class only, is always using a fixed number of static sized cores or units, and includes only high performance computing applications. This is also true for industrial products.
These restrictions limit the number of application scenarios for each architecture. To deploy hybrid hardware in a general-purpose environment and to support many ap-plications, the number and the size of the components has to be variable. Example applications benefiting from hybrid hardware in general-purpose computing are: im-age processing applications, simulation of electromagnetic fields, solid state physics and computer games. Image processing applications could use hybrid hardware to accelerate certain filter and transformation algorithms by uploading accelerator units into the re-configurable hardware. The simulation of electromagnetic fields and solid state physics can accelerate their computations by offloading certain calculations to the reconfigurable hardware. Both fields already use modern graphic cards to accelerate their computa-tions on general-purpose hardware. Reconfigurable hardware would enable developers to use more specialised hardware and increase the calculation power even more. Com-puter games also use modern graphic cards to accelerate physical calculations for their simulated world. Hence, with reconfigurable hardware, each computer game could bring its own hardware for doing such calculations. All these reconfigurable hardware can be implemented as an accelerator unit or multiple streaming processor cores. Individualis-ing hardware for each computer application can increase the processIndividualis-ing power or reduce the power consumption of the whole system. Often, applications in a general-purpose environment are running concurrently, inducing the requirement of a variable number and a variable size of reconfigurable modules. This all-purpose computing capabilities requires more flexible design rules than systems supporting just one combination class.
Computer systems are divisible into single-purpose computers, multipurpose comput-ers and general-purpose computcomput-ers. Single-purpose computcomput-ers are designed for a specific calculation. In this systems reconfiguration is used to update the system and to fix devel-opment mistakes. This is already very common. Multipurpose computers are specialised for a group of computations, such as audio and video processing. A typical
multipur-pose computer is a DSP. In some DSPs reconfigurable accelerator units are available.
They enable developers to extend the functionality or integrate new algorithms. The last computation class, the general-purpose computers, lacks support for reconfigurable hardware at the moment. This situation shall be changed by this thesis.
As mentioned earlier, theFPGAhas to be partitioned into multiple modules to support
runtime reconfiguration. This partitioning is fixed after the initial system design stage. This early stage floorplaning leads to the granularity problem of runtime reconfigurable design flow because different sized components shall be runtime reconfigurable with
1 Introduction
sized component determines the size of one module. This module size and the size of
theFPGAdetermines the number of available reconfigurable modules, which leads to a
very inefficient design, if components with very different sizes are used. This granularity problem, and the solution proposed in this thesis, are described more in Chapter 6.
Deploying hybrid hardware into general-purpose computing leads to another problem. At the moment it is relatively easy to write platform-independent programs by using a higher level programming language like C. Languages like Java are ignored because the programs are running in a runtime virtual machine, not on the bare hardware[20]. Virtual machines could be another target for hardware support in general-purpose
com-puters. One advantage of current general-purposeCPUs is, that all of them are based
on the von Neumann architecture[13]. This simplyfies the development of platform in-dependent code because a compiler can be written for all architectures, with the same
base assumptions, only differing in theISA. Writing platform independent programs for
hybrid hardware is much more complicated because these programs consist of software and hardware parts. The reconfigurable hardware in such a system is called configware. While the software part can still be written in C and is based on the von Neumann
ar-chitecture, the differentFPGAand CPU vendors have not agreed upon an architecture
for the hardware part yet. It cannot be expected that all these companies decide for the same reconfiguration approach for their hybrid hardware system. This complicates the development of the configware because developers have to describe hardware for different reconfiguration approaches.
Both problems — the granularity problem and the development of platform
indepen-dent code — are addressed in this thesis by implementing a multi FPGA framework
called Multicore Reconfiguration Platform (MRP). This framework uses a new
floor-planing technique for partitioning the FPGAs, and a Circuit Switched Network (CSN)
for interconnecting all the RMs. This combination of floorplaning and interconnection
network enables the framework to support a variable number of different sized
reconfig-urable components, only limited by FPGAsize, in contrast to all other, at the moment
available systems. This is achieved by dividing larger components into multiple smaller
components, which fit into the RMs and interconnecting them through the CSN. This
framework also simplifies the development of platform independent software and
con-figware because the framework can be synthesised for anyFPGA. It abstracts from the
underlyingFPGA and provides the sameApplication Programming Interface (API)for
every hybrid hardware developer.
The proposed floorplaning technique of the MRP and the CSN generate a medium
sized hardware overhead. Because of this overhead, theFPGAsize is a limiting factor in
the evaluation process. To overcome this restriction, the MRP supports a flexible and
easily extensible packet switched network, calledOn Chip Switching Network (OCSN).
It allows intra FPGA communication for configuring the RMs and programming the
CSN, and also inter FPGA communication, to combine multiple FPGAs to form a
larger hybrid hardware system. This feature is also a novelty, like the solution to the granularity problem and the platform independence of configware.
1.4 Thesis Structure
1.4 Thesis Structure
The thesis is organised in eleven chapters. The introduction in Chapter 1 briefly de-scribes the frame and the objectives of the thesis. To understand hybrid hardware, the
principles of reconfigurable hardware, FPGAs, and runtime/dynamic reconfiguration
are introduced in Chapter 2 and some example Reconfigurable Systems (RSs), related
to the MRP, are presented in Chapter 3. The MRP uses two different kinds of
Net-work On Chips (NOCs), the CSN and the OCSN. Chapter 4 introduces the principles
of NOCs. It describes the Open Systems Interconnection Model (OSI) and presents a
network classification based on work done by Schwederskiet al.[21] and Feng[22]. Some
important interconnection networks are described and rated according to this classifi-cation in Chapter 5. After the introduction of all basic principles, Chapter 6 explains
the granularity problem of runtime reconfigurable design flow, occurring, if FPGAs are
divided into multiple RMs to support flexible PR designs and describes possible
solu-tions to the problem. The main work of the thesis, theMRP, is presented in Chapter 7.
It introduces the CSN, OCSN and the design of the RMs. Chapter 8 describes the
implementation of the MRP in more detail. Because the MRP is designed as a hybrid
system it needs support from the Operating System (OS). The required device drivers
are described in Chapter 9. The verification, proving that the MRP is usable and
al-lows the reconfiguration of multiple different sized computing elements, is presented in
Chapter 10. It evaluates the MRP according to area usage, maximum clock speed and
example implementations. The conclusion of the thesis results and an outlook to future work is given in Chapter 11.
2 Reconfiguration Fundamentals
Reconfigurable hardware describes some kind of electronic circuit, whose Boolean
func-tion can be changed or reconfigured after production of the circuit. Such hardware
supports the creation of variable and specialised components the moment they are re-quired. Different approaches exist to build basic elements of reconfigurable hardware.
These basic elements can be combined to form larger systems and are produced asICs,
such asFPGAs,Programmable Logic Arrays (PLAs),Complex Programmable Logic
De-vices (CPLDs)andProgrammable Array Logics (PALs). The most important difference
between these systems is their basic reconfigurable component. FPGAs are build out
of LookUpTables (LUTs), while PLAs, PALs and CPLDs use arrays of and/or
matri-ces to configure Boolean functions. Another approach on reconfigurable hardware uses
multiplexers. All the reconfigurable ICs can be used to build RSs or hybrid hardware
systems. These systems often combine a general-purpose processor with some reconfig-urable hardware to improve the computational power of the processor. This approach is
called Reconfigurable Computing (RC). The following sections give a short introduction
to reconfigurable hardware. Compton et al.[23] provides a more detailed overview of
reconfigurable hardware and related software.
2.1 Matrix Approach
The basis for the matrix approach is the and/or matrix. Figure 2.1 shows an example
0 1 d c b a
&
&
&
a b c d y0 y12 Reconfiguration Fundamentals
matrix. On the left side, the and matrix prepares the connection of input signals, the
negated input signals, a zero and a one signal to some and-gates. None of the vertical
signals are connected to the horizontal ones at the moment. The intersections of these
signals are connected to a programmable switch, such as an electronic fuse or a Static
Random Access Memory (SRAM) cell. An electronic fuse will make the matrix
one-time programmable, while the SRAM or other memory types will cause a multiple
programmable matrix. On the right side, the or-matrix prepares the connection of the
and-gates to some or-gates. The intersections of the signals are used the same way as
the intersections of theand-matrix. To configure a Boolean function of typef :Bn→B
into thisand/or matrix, the function is required inDisjunctive Normal Form (DNF). A
DNF is the normalisation of a logical function, displayed as a disjunction of conjunctive
clauses. Every logical function, without quantifiers, can be converted to DNF[24].
a b S Cout
0 0 0 0
0 1 1 0
1 0 1 0
1 1 1 1
Table 2.1: Truth table of a Halfadder
0 1 d c b a
&
&
&
a b c d Cout SFigure 2.2: Halfadder implemented in an and/or Matrix
Figure 2.2 displays an example implementation of a HalfAdder with the truth table
given in Table 2.1. The formulas forS and Coutcan be read out of the truth table:
S = (a∧ ¬b)∨(¬a∧b),
2.2 Multiplexer Approach
Both are in DNF and can be directly implemented into an and/or Matrix. The nodes
in Figure 2.2 represent connections at the intersection points of the signals. Three forms of expressions exist for the matrix approach.
• The and and theor matrix are programmable.
• Only the and matrix is programmable, the or matrix has a fixed programming.
• Only the or matrix is programmable, the and matrix has a fixed programming.
DifferentICs use different expressions of the matrix approach.
2.2 Multiplexer Approach
A multiplexer is a small digital selector device. It routes one of n input signals to its
output. The number of input signals depends on the number of selection signals. If x
selection signals are available, the multiplexer can process 2x input signals. Figure 2.3
shows a 4 to 1 multiplexer with data inputs e0. . . e3 and selection inputss0 and s1.
4-1 MUX 00 01 10 11 s0 s1 y e0 e1 e2 e3 Figure 2.3: 4 to 1 Multiplexer
Simple Boolean functionsf :B×B→Bcan be build out of this multiplexer by using
s0 and s1 as the input variables and assigning each of the data inputs the results of the
function. Table 2.2 shows how to implement the logic functions∧,∨and⊕with a
mul-e0 e1 e2 e3 function
0 0 0 1 f(s0, s1) =s0∧s1
0 1 1 1 f(s0, s1) =s0∨s1
0 1 1 0 f(s0, s1) =s0⊕s1
Table 2.2: different Boolean functions implemented with a 4 to 1 multiplexer
tiplexer. To make this approach reconfigurable to different Boolean functions, FlipFlops
2 Reconfiguration Fundamentals 4-1 MU X 00 01 10 11 s0 s1 e0 e1 e2 e3 4-1 MU X 00 01 10 11 s0 s1 e4 e5 e6 e7 4-1 MU X 00 01 10 11 s2 s3 y 4-1 MU X 00 01 10 11 s0 s1 e8 e9 e10 e11 4-1 MU X 00 01 10 11 s0 s1
e12 e13 e14 e15
Figure 2.4: Cascaded 4 to 1 Multiplexer
functions can be configured. This pattern can be extended to implement functions of
typef :Bn→
Bby cascading multiplexers. An example is given in Figure 2.4. There are
two additional input variables available: s2 and s3. Hower, this pattern does not scale
because for every two input variables the required number of multiplexers quadruples. Another method to increase the number of input variables is to increase the number
of selection signals, but this will not scale either due to signal fanning. For x selection
signals 2x input signals are required.
Functions of type f :Bn→Bm have to be split in m functions of typef :Bn→Bto
be implementable with the multiplexer pattern.
2.3 Look Up Table Approach
A better solution to implement reconfigurable functions of typef :Bn →B is to use a
smallRAM orLUT. The address signals of theRAM are used as the input parameters
and the data words hold the result of the function. Table 2.3 displays the implementation
of the simple Boolean functions ∧,∨ and ⊕ in a LUT with an address width of three
and a data width of eight. Because only two operands are required for these operations,
a1 and a2 are selected as the input variables. The result is encoded in the dataword,
starting from the first left bit for ∧.
It is obvious that theLUT approach supports the calculation of multiple functions of
type f :Bn→Bconcurrently by using different bits of the data-word as the result.
This approach is better suited for the calculation of f :Bn→Bm functions than any
other presented approach because it only requires oneLUT, as long asm is less or equal
the size of one data word. For functions withmgreater the size of one data word,LUTs
2.4 Field Programmable Gate Arrays a0 a1 a2 Dataword (8bit) 0 0 0 00000000 0 0 1 01100000 0 1 0 01100000 0 1 1 11000000 1 0 0 00000000 1 0 1 00000000 1 1 0 00000000 1 1 1 00000000
Table 2.3: Example LUT implementing ∧,∨and ⊕
2.4 Field Programmable Gate Arrays
To extend boolean functions as explained in previous subsections to Finite State
Ma-chines (FSMs)or even more compley circuits it is necessary to have memory and
inter-connects.
ManyIC provide the required resources to configure digital circuits, such as FPGAs,
PLAs, CPLDs and PALs. This section describes the general structure of FPGAs
be-cause they are used for the prototype system in this thesis. Many books provide this
information, but this section is based on the book by Urbanskiet al.[25]. In contrast to
the name, aFPGAis not an array of gates, but an array of configurable basic elements,
as there areConfigurable Logic Blocks (CLBs),Input/Output Blocks (IOBs),Block RAM
(BRAM), small DSPs and Clock Management Tiless (CMTs). Figure 2.5 displays the
IOB IOB IOB IOB IOB IOB
IOB IOB IOB IOB IOB IOB
CLB CLB CLB CLB CLB CLB
CLB CLB CLB CLB CLB CLB
CLB CLB CLB CLB CLB CLB
Figure 2.5: Simple structure of an FPGA without interconnects
basicFPGAstructure withCLBsandIOBs, and without interconnects. They are
organ-ised in an array structure to simplify the interconnection of the blocks. All components
of theFPGAare vendor and device specific. The focus here is on Xilinx Virtex5FPGAs.
2 Reconfiguration Fundamentals
2.4.1 Input/Output Blocks
IOBsare the interface from the configured hardware to the input and output pins of the
FPGA. They are also configurable by the developer to support different voltage levels
and input/output signal standards, such as Low-Voltage Transistor Transistor Logik
(LVTTL),Low-Voltage Differential Signaling (LVDS), andHigh-Speed Transceiver Logic
(HSTL).
2.4.2 Configurable Logic Blocks
CLBs are the main reconfigurable elements of the Virtex5 FPGAs. Figure 2.6 displays
Slice X0Y1 Slice X0Y0 Slice X1Y0 Slice Y1Y1 SHIFT CIN COUT Switch Matrix Fast Connects to neighbors CIN COUT
Figure 2.6: Structure of two Virtex5 CLBs[3]
the structure of two CLBs. The switch matrix is already part of theFPGAs
intercon-nection network. OneCLB consist of two slices. These slices are tightly interconnected
through carry lines to increase the operand size of Boolean functions. Always twoCLBs
are connected through a shift line to form large shift registers.
Every slice contains fourLUTs, which are the basic reconfigurable elements ofFPGAs,
four storage elements, wide-function multiplexers, and carry logic[3].
The used LUTs have six independent inputs and two independent outputs. This
structure supports the configuration of one Boolean function of type f : B6 → B or
two Boolean functions of type f : B5 → B if the two functions share the same input
parameters. Three multiplexers are connected to the four LUTsin one slice to support
combining twoLUTsto increse the number of possible inputs to seven or eight. Functions
with more inputs are implemented by combining slices.
D-typeFFsprovide storage functionality within each slice. Their input can be directly
driven from aLUT. Some special slices provide more storage capacity by mergingLUTs
2.4 Field Programmable Gate Arrays
2.4.3 Block RAM
FPGAs support BRAM to provide reconfigurable hardware with fast and area
inexpen-sive RAM. On XilinxFPGAs BRAM is provided in 36kbyte blocks. They are placed in
columns on theFPGA. The number of available blocks isFPGAdependent. For Virtex5
devices the available BRAM ranges from 144 kbytes up to 2321 kbytes.
BRAM can be used as single port, dual portRAM, or as First In First Out (FIFO)
queues. Virtex5FPGAseven provide dedicated hardware for asynchronousFIFOqueues,
reducing space requirements of the reconfigurable hardware. Access times forBRAM are
very fast, compared to off-chipDouble Data Rate (DDR) RAM. A dataword is available
one clock tick after issueing the address into the RAM, making it a good choice for fast
buffers or caches.
2.4.4 Special IO Components
Often, reconfigurable hardware requires special I/O components, such as Ethernet,Serial
Advanced Technology Attachment (SATA),PCI, etc.. Implementing these I/O
compo-nents in reconfigurable hardware is possible, but requires much FPGA space.
There-fore, theFPGAs support some special non-reconfigurable I/O hardware. This hardware
implements common parts of I/O devices, which can be used to create the required
components. The Virtex5 FPGAfamily supports Ethernet MACs, and RocketIO GTP
Transceivers.
Ethernet MACs reduce the area usage for Ethernet devices because they implement
theMedia Access Control (MAC) layer of the Ethernet protocol.
RocketIO GTP Transceivers support general components for high speed serial I/O like 8b/10b encoders/decoders and fast serialiser and deserialiser. These transceivers can be
used to implement the physical layer of thePCI orSATAbus. The correct working mode
can be set through special instructions in the Hardware Description Language (HDL).
2.4.5 Interconnection Network
The interconnection network and theCLBs are the most important parts of theFPGA.
Without the interconnection network the CLBs can not be combined and larger
com-ponents can not exchange data. FPGAs distinguish three different signal types, which
have to be routed through the interconnection network with different priorities and signal latencies.
clock signals Clock signals require a fast distribution time throughout the FPGA
because they synchronise all the components to its rising or falling edge.
reset signals Reset signals are similar to clock signals. Through reset signals
com-ponents are initialised at the same moment. This also requires a fast distribution
throughout theFPGA.
I/O signals For I/O signals a fast distribution is also important, but the maximum
2 Reconfiguration Fundamentals
Another important requirement for I/O signals is their number. A normal design only has around one to three different clock signals and as much reset signals, but the number of I/O signals are very huge.
Therefor, the FPGAs support two different interconnection networks. One for clock
and reset signals and one for all the I/O signals, required to exchange data between components.
2.5 Partial Reconfiguration
PR is a feature and a design flow of Xilinx Virtex5, Virtex6, and Virtex7FPGAs[2]. It
extends the normal configuration possibility of FPGAs with the ability to modify parts
of a running configuration, without interrupting the computation.
The design is divided into a static and a reconfigurable part during development. For the static part special entities, called reconfiguration modules, are defined, which hold the reconfigurable components. This definition includes a signal interface declaration for communicating with the static part. There can be different reconfiguration modules in one design with variable number of instances. The reconfigurable part of the design consist of entity descriptions for every component, which should be configurable into one module. FPGA RM1 RM0 RM00.bit RM01.bit RM02.bit RM03.bit RM10.bit RM11.bit RM12.bit RM13.bit ,,static´´ Logic
Figure 2.7: simple PR example[2]
The synthetisis process creates someFPGAconfiguration files. The main file includes
the static design and a component for each instance of a reconfiguration module. For every component and every instance an additional partial configuration file is created.
These files can be loaded into the FPGA after the main file to reconfigure certain
re-configuration module instances. Figure 2.7 shows a simple example of a reconfigurable system. It features two reconfiguration module instances and four partial configuration
files per module. Instances can only be configured into the RMs for which they have
3 Example Reconfigurable Systems
3.1 Research Systems
3.1.1 RampSoC
ARampSoC is aMulti-Processor System-on-Chip (MPSoC)that can be adapted during
run-time by exploiting dynamically and partially reconfigurable hardware[4]. A special design-flow is used, which combines the top-down and up approach. The
bottom-up approach is used during design time to set bottom-up the basic conditions of a RampSoC
according to the problem-space it should be used in. In the top-down approach the software is optimised for this initial setup. Parts of this initial setup can be reconfigured to meet arising needs of applications during runtime, such as a different processor core
or a special accelerator unit. Figure 3.1 shows a possible RampSoC configuration at
FPGA Switch Switch Micro-Processor (Type 1) Accelerator Micro-Processor (Type 1) Accelerator Micro-Processor (Type 2) Accelerator Switch Switch Micro-Processor (Type 1) Accelerator Micro-Processor (Type 1) Accelerator Accelerator Micro-Processor (Type 1) Accelerator Accelerator Switch Switch
3 Example Reconfigurable Systems
some point in time. Two types of processor cores are supported in this configuration, each having at least one accelerator unit. Switches connect the individual cores to the communication network.
The implementation of aRampSoC is done using the early access PR concept of
Xil-inx. This design flow is not supported by the Xilinx toolchain anymore. The early
access PR design flow requires, that reconfigurable modules are defined before
synthe-sis of the project. To reconfigure different cores, accelerators and the communication infrastructure all reconfigurable parts have to be defined at the system design stage. The maximum number of accelerators and processor cores is fixed during runtime. The developer has to decide, if each type of core requires its own reconfiguration module defined or if the biggest core size is selected as the size for the reconfiguration unit. He
has to balance between space exploitation and flexibility. TheRampSoC approach uses
proprietary processor cores, such as Pico- and Microblaze cores from Xilinx. To this cores accelerator units are connected, which can change their hardware function while the processor is executing a program.
TheRampSoC approach is a very flexible improvement compared to normal
multicore-processors or MPSoCs. Its heterogeneous structure allows the optimal execution of
applications with different hardware requirements and can adapt to applications needs
during runtime very easily. Processor cores can even be exchanged by special FSMs
supporting calculations in special hardware components.
3.1.2 PRHS
The Partial Reconfiguration Heterogenous System (PRHS)developed by Eckert[5] tries
to exploit the available new space onICsalso by reconfiguration. ThePRHS is a softcore
SoC, configured onto aFPGA. It features oneRM of the XilinxPRdesign-flow. In the
availableRM different hardware components can be configured. TheRM can accelerate
computations on the SoC, but its main pupose is virtualisation.
Virtualisation in this case means the instantiation of a full SoC running under the
supervision of the static core. The virtualised SoC also runs Linux as OS. Figure 3.2
displays this scenario. The static system on the right is running Linux as its OS. It
has full access to memory and memory mapped IO hardware components likeUniversal
asynchronous receiver/transmitters (UARTs) or timers. On the left a RM is available
and connected to the static system. TheSoC configured at runtime into thisRM has
only partial access to the memory. The accessible memory space is configured from the static system before the virtualised system is started. A memory mapped IO component
interconnects the RM and the static system. It supports starting and stopping the
virtualised system, but not suspending it. Providing a virtualised hard-disk to the
reconfigurable system is another feature of the static system.
The PRHS is an interesting way of using tighly couple reconfigurable hardware from
a static processor core. The virtualised processor cores can feature different ISAs and
3.1 Research Systems pr ocessor (pr hspA) dataCa che (Cac he) instrBu sCtr l (BusCt rl) instrC ach e (Cac he) systemArbiter (prhsSDbusAr biter) dataBusC trl (BusCt rl) ClockSour ceT imer (timer4pr hs) uart 0 (uar t4pr hs) SysIntC hip (intch ip4pr hs) ClockE ventT imer (timer4pr hs) B CS (busC omponentSta tus) bootR am (bram4 pr hs) 30 32 primary instru ction bu s primary data bu s secondar y instru ction bu s secondar y data bu s pr ocessor data bu s pr ocessor instru ction bu s nIRQ data SD bus instruction SD bu s 30 32 timers a nd ua rt0 pr esent infor mation RS232 T x/Rx lines 0 icBusStat usLines icnExt Inter rupts static PRHS SD Bus staticSys (base) uart 1 (uar t4pr hs) RS232 T x/Rx lines 1 28 <option base> 28 PR e xtension or uar t1 pr esent infor mation ReconfArbiter (prhsSDbusAr biter) PRHS Bus <option base> reconf PR HS SD Bus <option reconf> <option r econf> PRHS SD Bus baseR econf reconfIF4 pr hs icap4pr hs recon fi guration guard recon fi gurable m odule PR e xtension_in st (PR e xtension)
3 Example Reconfigurable Systems
3.1.3 Dreams
Dreams is not directly a RS, but it is a tool to build runtime reconfigurable systems.
It processes Xilinx Description Language (XDL)files, created by the Xilinx tools, and
provides a partial reconfiguration design flow on top ofPR. While the Xilinx design flow
enforces the developer to run the synthesis, place, and route process for everyRM and
every implementation of a module, the dreams design flow does not. It supports easy
relocation ofRMs just synthesised, placed and routed one time.
XDL is a human readable language for describing netlists. It is compatible with the
ncd netlist file format and Xilinx provides programs for easy conversion.
Dreams is developed by Otera et al.[26]. It tries to improve the Xilinx design flow in
four different ways:
1. Module relocation in any compatible region in the device 2. Independent design of modules and the static system 3. Hiding low level details from the designer
4. Enhanced module portability among different reconfigurable devices
Its design flow targets reconfigurable architectures build out of disjoint rectangular re-gions.
The system architecture, enforced by the Dreams tool, is divided intoVirtual Regions
(VRs)andVirtual Architectures (VAs). AVAcombinesFPGAresources for use as aRM
or static module. TheVAdescribes the full system, including static and reconfigurable
parts and how they are interconnected using the FPGAs interconnect. The VRs and
the VA description are provided by Extensible Markup Language (XML) files by the
developer.
Dreams is a very interesting tool. Very large reconfigurable systems suffer in the Xilinx
PRdesign flow from very long placement and routing times. Dreams could significantly
reduce these times and improve the development time of such systems.
3.2 Commercial Systems
3.2.1 Convey HC1
One commercially availableRS is the Convey HC1[6]. It combines four Xilinx Virtex5
FPGAs with an Intel Xeon processor through the X86 co-processor interface. Figure 3.3
gives an overview of this architecture. The system contains two memories, one connected
to the processor cores and another one connected to the fourFPGAs. Both are accessible
from the processor and the FPGA side. Hardware ensures cache-coherency between
them. The memory on the FPGA side is specially partitioned to support concurrent
access to different memory banks from differentFPGAs to increase the overall memory
3.2 Commercial Systems
"Commodity" Intel Server Intel 5138 Dual Core Processor Intel x86-64 Server x86-64 Linux Intel IO Subsystem Intel 5400 MCH Memory
Convey FPGA-based coprocessor
Application Engine Hub Application Engines Virtex5 FPGA Virtex5 FPGA Virtex5 FPGA Virtex5 FPGA Memory FPGA based
Shared cache-coherent memory
Figure 3.3: Overview of the Convey HC1 architecture[6]
Communication with theFPGAsis implemented using the coprocessor interface of
In-tel processors. Software running on the Xeon processor can trigger hardware operations
running on one of the FPGAs by issuing special coprocessor instructions and writing
data, required for the operation, to special memory regions. Programs can change
con-figurations in idle times of theFPGA. The Xilinx PR design flow is basically available,
but is not supported yet by Convey, enforcing long reconfiguration latencies and very
fixed FPGAdesigns. Still, the Convey HC1 is a very interesting platform for high
per-formance computing. In high perper-formance computing the accelerator hardware seldom changes and one important factor is memory access. Memory access is very fast on the HC1 because of their special memory layout.
3.2.2 Intel Stellarton
Another commercialRS is the Intel Stellarton processor andFPGA SoC[14]. It combines
a standard Intel Atom Stellarton processor core with an AlteraFPGAon the same chip,
but not on the same die. Figure 3.6 gives an overview of its hardware structure. TheSoC
contains all the standard components of the Intel Atom processor, like DDR interface,
graphics adaptor/accelerator, audio component andPeripheral Component Interconnect
Express (PCIe) bus interface.
The AlteraFPGA[27] ist connected to the processor by this PCIe bus. Through this
bus the FPGA is configurable and application data can be exchanged between FPGA
and processor. The main purpose of this RS was to improve the performance of host
programs by accelerator hardware.
The production of the system has been discontinued, but a new approach by Intel seems to be on its way, according to Diane Bryant[28]. According to her, Intel is working
on combining their Xeon server processors with FPGAs to improve the performance of
3 Example Reconfigurable Systems
Intel Atom Processor DDR2 IF
SPI, SMBus
Graphics Legacy
GPIO Intel Audio
PCIe Gen 1 PCIe PCIe
FPGA
Figure 3.4: Structure of an Intel Stellarton Processor, combined with an Altera FPGA
3.2.3 Xilinx Zynq Architecture
Zynq[7] is a very new hybrid hardware system produced by Xilinx. It features a dual
ARM Cortex A9 processor core connected to many peripherals and aFPGAthrough an
Advanced Microcontroller Bus Architecture (AMBA)bus. Figure 3.5 presents the overall
system structure. Processor core andFPGAshare the same chip, but not the same die,
like the Intel Stellarton processor. It supports a lot of static hardware components to
connect to common embedded devices, such asInter-Integrated Circuit (I2C)controller,
Serial Peripheral Interface (SPI) controller, or Controller Area Network (CAN)
con-troller. The FPGA is connected to the processor through an AMBA bus. The AMBA
bus is a very common bus in embedded devices. It supports general-purpose ports and
high performance ports from the processor to theFPGA. TheFPGAhas access to high
speed serial I/O transceivers going offchip and to theAMBAbus. All other features of
a Virtex7FPGA are also supported, includingPR.
The Zynq architecture is an interesting system for embedded hardware developers.
On the ARM processor cores a standard embedded OS can run and the FPGA can
improve calculation performance for special applications, like audio and video editing, radio transmissions, and cryptographic algorithms.
3.2 Commercial Systems
3 Example Reconfigurable Systems
3.3 COPACOBANA and RIVYERA
FPGA 2 FPGA 3 FPGA 4 FPGA 5 FPGA 1 FPGA 0 FPGA 6 FPGA 7 Svc FPGA
Host Interface Backplane
Figure 3.6: COPACOBANA and RIVYERA interconnection overview
The Copacobana and Revyera systems developed by SciEngines hybrid hardware sys-tems optimized for cryptoanalysis and scientific computing.
Both systems consist of many interconnected FPGAs working together to solve a
problem. The host system is connected through 10Gbit Ethernet cards, 4Gb Fibre
Channel cards, or InfiniBand. The Copacobana can try the complete 56-Bit DES key space within 12.8 days. The Revyera is the advancement of the Copacobana.
4 Interconnection Networks
Modern hardware design often requires the development of some interconnected com-ponents. Different interconnection network schemes are available today. If more tightly coupled systems are required these components are combined on a single chip. Such a
tightly connected system is calledSoC.
Figure 4.1 displays an example mobile phone system, with three different
intercon-nection schemes. This system can be developed as a multi-chip system or as a SoC.
The shown mobile phone system consist of a CPU, memory, a DSP, a keypad, and a
Memory CPU RF DSP Keypad a) bus connection Memory CPU RF DSP Keypad b) P2P connection Memory CPU RF DSP Keypad c) noc connection Switch Switch
Figure 4.1: Example mobile phone SystemOnChip (SoC)
radio transceiver. These components interact in different ways to get the mobile phone running. The interactions can be implemented using different kinds of interconnection networks. Figure 4.1 shows three possible topologies. In a) all components are connected to a bus with the typical bus communication restrictions, such as exclusive bus access for a single component and poor scaleability. In b) all components are directly connected with all components they are interacting with. This network topology supports a very flexible communication, but requires many interconnection links. The last displayed topology is a packet switched network build out of the components and switches. This
kind of networks are calledNOCs. NOCs are very similar to the communication
infras-tructure of inter computer networks, such asLocal Area Networks (LANs)orWide Area
Networks (WANs).
Much more different network architectures exist. To distinguish these networks and to easily highlight their differences and performance properties a classification is necessary.
In this work part of the classification done by Schwederski et al. [21] is used, which is
4 Interconnection Networks
The base for a classification is usually a mathematical representation of the entity of interest. In this case finite graphs are a good representation of interconnection networks.
The edges of the graph model the interconnection links and the nodes are theProcessing
Elements (PEs), connected to the network. A PE is the component doing calculations
and using the network for communication purposes, such as a processor core, aDSP, or
some other kind of device controller.
This chapter is organised as follows: Section 4.1 describes theOSI. It is an industrial
standardising model for different communication protocols, simplifying their develop-ment.
The distinguishing characteristics of NOCs are explained and described from
Sec-tion 4.2 to SecSec-tion 4.8.
4.1 Open Systems Interconnection Model
Communication systems mostly consist of more than just two communication partners. These communication partners can be under the control of the same developer or com-pany, but this is not always the case. Data is transmitted over multiple nodes to reach its destination and the underlying infrastructure can differ from node to node because of different responsibilities. The transmitted data can be divided into a header, enclosing source and destination addresses, payload size, quality of service information, and the actual payload. The position of the header data and the payload has to be defined to help every developer and manufacturer to produce compatible hardware. Later in this
work, protocols will be described, using the terminology of theOSI.
The International Telecommunication Union (ITU) and the International
Organiza-tion for StandardizaOrganiza-tion (ISO)[29] developed the OSI model to simplify the definition
of communication protocols. Seven functional distinct layers divide the communication process. Figure 4.2 gives a graphical representation of these layers and the expected protocol flow. The flow starts at either side of the network stack. If some data shall be transmitted to another communication partner, the communication usually starts at the application layer. Every layer processes the data and passes it down to the next layer until reaching the physical layer. Each layer adds header information or transforms the data according the network requirements. Sometimes control messages are created, passed down the layers and send to their corresponding layer at the next communication partner, to create a virtual connection between them.
The physical layer transmits the data through some kind of medium (wire, air, fibre optic, . . . ) to the next node. After the transmission, the data passes the layers up. If the node is just an intermediate one the data moves up to the network layer, where it gets formatted for the transmission to the next node. If the data has arrived at its destination, it gets passed up to the application layer.
In the following sections each of the seven layers is briefly described. More information
4.1 Open Systems Interconnection Model
physical layer data link layer network layer transport layer session layer presentation layer application layer physical layer data link layer
network layer transport layer session layer presentation layer application layer Protocol
Network Stack Network Stack
physical transmission of bits
Data Data
Figure 4.2: graphical representation of the ISO/OSI Model
4.1.1 Application Layer
The application layer is the interface between a program or application running on aPE
and the communication infrastructure. It defines the interaction between two or more communication partners, such as how to request some data or how to send the partner data. For this interaction the application does not require any information about the underlying network, the destination address is enough. Very common application layer
protocols used in the Internet are Hypertext Transfer Protocol (HTTP)and Post Office
Protocol Version 3 (POP3).
4.1.2 Presentation Layer
Data can be presented in multiple forms. For example some processor cores use big endian or little endian byte encoding for working with structures bigger than one byte.
A higher level form ist the language encoding with ISO codes or UTF-8.
To allow the application layer to just use the passed data, the presentation layer converts and transforms the data to the required representation.
The presentation layer can be used to implement point to point encryption too.
4.1.3 Session Layer
A communication session consists of the connection establishment, the transmission and reception of multiple data and the detachment of the connection.