Multicore Reconfiguration Platform A Research and Evaluation FPGA Framework for Runtime Reconfigurable Systems

(1)

Multicore Reconfiguration Platform —

A Research and Evaluation FPGA

Framework for Runtime Reconfigurable

Systems

Dipl.-Inf. Dominik Meyer

18. M¨

arz 2015

(2)

(3)

Multicore Reconfiguration Platform —

A Research and Evaluation FPGA Framework

for Runtime Reconfigurable Systems

Von der Fakult¨

at Elektrotechnik

der Helmut-Schmidt-Universit¨

at/

Universit¨

at der Bundeswehr Hamburg

zur Erlangung des akademischen Grades

eines Doktor-Ingenieurs

genehmigte

DISSERTATION

vorgelegt von

Diplom-Informatiker Dominik Meyer

aus Rendsburg

(4)

Gutachter Prof. Dr. Bernd Klauer

Prof. Dr. Udo Z¨olzer

Vorsitzender der Pr¨ufungskommission Prof. Dr. Gerd Scholl

Tag der m¨undlichen Pr¨ufung 16.03.2015

Gedruckt mit freundlicher Unterst¨utzung der HSU-Universit¨at der Bundeswehr

(5)

Curriculum Vitae

Personal information

Surname(s) / First name(s) Meyer, Dominik Email(s) [email protected] Nationality(-ies) German

Date of birth June 17, 1976 Education

Dates 1993 - 1997 Title of qualification awarded Abitur Name and type of organisation

providing education and training Helene Lange Gymnasium Rendsburg/ Germany Dates 1998 - 2008

Title of qualification awarded Diplom in Computer Science Name and type of organisation

providing education and training Christian-Albrechts-Universit¨at zu Kiel Work experience

Dates 2000 - 2003

Occupation or position held technical advisor/manager Main activities and

responsibilities Buildup and management of the server infratructure of aninternet service provider and webhoster. Name and address of employer PcW KG

Dates 2003 - 2009 Occupation or position held technical manager

Main activities and

responsibilities Buildup and management of the server infratructure of awebhoster. Development of firewall solutions. Name and address of employer die Netzwerkstatt

Dates 2009 - now

Occupation or position held research assistant Main activities and

responsibilities research in runtime reconfigurable systems

Name and address of employer Computer Engineering/ Helmut Schmidt University Hamburg

(6)

Publications

[1] Dominik Meyer. Runtime reconfigurable processors. Presentation at the Chaos Communication Camp, 2011. [2] Dominik Meyer. Introduction to processor design. Pre-sentation at the 30th Chaos Communication Congress, 2013.

[3] Dominik Meyer and Bernd Klauer. Multicore reconfig-uration platform an alternative to rampsoc. SIGARCH Comput. Archit. News, 39(4):102–103, December 2011.

(7)

Acknowledgments

This thesis is the result of my work at the Institute of Computer Engineering at the Helmut Schmidt University/ University of the Federal Armed Forces Hamburg.

I want to thank Prof. Dr. Bernd Klauer, my chair, for his support and the opportunity to work on this thesis. I also want to thank the remaining members of my dissertation

committee Prof. Dr. Scholl and Prof. Dr. Z¨olzer.

The discussions of my research results with my current and former colleagues at the Helmut Schmidt University helped a lot. Therefore, I want to thank Marcel Eckert, Rene Schmitt, Klaus Hildebrandt, Christian Richter and Jan Haase.

Finally, I want to thank my girl friend, Sarah Zingelmann, for her understanding and support during the last years.

(8)

(9)

Acronyms

AES Advanced Encryption Standard.

ALU Arithmetical Logical Unit.

AMBA Advanced Microcontroller Bus Architecture.

API Application Programming Interface.

BRAM Block RAM.

CAN Controller Area Network.

CDC Clock Domain Crossing.

CEB Configurable Entity Block.

CLB Configurable Logic Block.

CMT Clock Management Tiles.

CPLD Complex Programmable Logic Device.

CPU Central Processing Unit.

CSMA/CD Carrier Sense - Multiple Access / Collision Detection.

CSN Circuit Switched Network.

DDR Double Data Rate.

DIP Dual Inline Package.

DNF Disjunctive Normal Form.

DSP Digital Signal Processor.

FF FlipFlop.

FFT Fast Fourier Transformation.

FIFO First In First Out.

FPGA Field Programmable Gate Array.

FSM Finite State Machine.

GPIO General Purpose Input Output.

GPU Graphical Processing Unit.

HDL Hardware Description Language.

HSTL High-Speed Transceiver Logic.

HTTP Hypertext Transfer Protocol.

I2C Inter-Integrated Circuit.

IC Integrated Circuit.

ICAP Internal Configuration Access Port.

ILP Instruction Level Parallelism.

IOB Input/Output Block.

(10)

Acronyms

ISA Instruction Set Architecture.

ISO International Organization for Standardization.

ITU International Telecommunication Union.

LAN Local Area Network.

LED Light Emitting Diode.

LUT LookUpTable.

LVDS Low-Voltage Differential Signaling.

LVTTL Low-Voltage Transistor Transistor Logik.

MAC Media Access Control.

MPSoC Multi-Processor System-on-Chip.

MPU Multiplyer Unit.

MRP Multicore Reconfiguration Platform.

NOC Network On Chip.

OCSN On Chip Switching Network.

OS Operating System.

OSI Open Systems Interconnection Model.

PAL Programmable Array Logic.

PCI Peripheral Component Interconnect.

PCIe Peripheral Component Interconnect Express.

PE Processing Element.

PLA Programmable Logic Array.

POP3 Post Office Protocol Version 3.

PR Partial Reconfiguration.

PRHS Partial Reconfiguration Heterogenous System.

RAM Random Access Memory.

RampSoC Runtime adaptive multiprocessor system-on-chip.

RC Reconfigurable Computing.

RM Reconfigurable Module.

RO Ring Oscillator.

RS Reconfigurable System.

RTL Register Transfer Layer.

SATA Serial Advanced Technology Attachment.

SCI Scalable Coherent Interface.

SoC System on Chip.

SPI Serial Peripheral Interface.

(11)

Acronyms

TCP Transmission Control Protocol.

UART Universal asynchronous receiver/transmitter.

UDP User Datagram Protocol.

USB Universal Serial Bus.

VA Virtual Architecture.

VHDL Very High Speed Integrated Circuits HDL.

VR Virtual Region.

WAN Wide Area Network.

XDL Xilinx Description Language.

(12)

(13)

List of Figures

1.1 History of the ic processing size[1] . . . 1

1.2 partitioning of an FPGA for the Xilinx PR design flow[2] . . . 3

2.1 and/or Matrix . . . 9

2.2 Halfadder implemented in an and/or Matrix . . . 10

2.3 4 to 1 Multiplexer . . . 11

2.4 Cascaded 4 to 1 Multiplexer . . . 12

2.5 Simple structure of an FPGA without interconnects . . . 13

2.6 Structure of two Virtex5 CLBs[3] . . . 14

2.7 simple PR example[2] . . . 16

3.1 example RAMPSoC Configuration[4] . . . 17

3.2 PRHS System Overview[5] . . . 19

3.3 Overview of the Convey HC1 architecture[6] . . . 21

3.4 Structure of an Intel Stellarton Processor, combined with an Altera FPGA 22 3.5 Structure of the Xilinx Zynq architecture[7] . . . 23

3.6 COPACOBANA and RIVYERA interconnection overview . . . 24

4.1 Example mobile phone SystemOnChip (SoC) . . . 25

4.2 graphical representation of the ISO/OSI Model . . . 27

4.3 direct and indirect interconnection networks . . . 29

5.1 Example Ring network with eight nodes . . . 39

5.2 Example bus with 4 nodes . . . 40

5.3 Example grid networks with 16 nodes . . . 43

5.4 Example tree networks . . . 45

5.5 Example 4×4 crossbar networks . . . 46

6.1 Example granularity problem . . . 48

6.2 Example grouping solution configuration . . . 49

6.3 Example granularity solution configuration . . . 51

6.4 Area requirements of the different usage patterns . . . 52

7.1 Example MRP System Overview . . . 53

7.2 OCSN frame description . . . 56

7.3 OCSN network structure overview . . . 56

(14)

LIST OF FIGURES

7.5 Example support platform . . . 59

7.6 Example reconfiguration platform . . . 62

7.7 CEB Signal Interface . . . 63

7.8 CSN group . . . 65

7.9 full MRP design flow . . . 68

7.10 reduced MRP design flow . . . 69

8.1 Clock Domain Crossing (CDC) component interface . . . 71

8.2 Dual Port Block RAM interface . . . 72

8.3 SimpleFiFo interface . . . 73

8.4 Reception of one OCSN Frame . . . 73

8.5 OCSN physical transmission component . . . 74

8.6 OCSN physical reception component . . . 74

8.7 Flowchart of OCSN identification protocol . . . 75

8.8 Flowchart of OCSN flow control protocol . . . 76

8.9 OCSN IF signal interface . . . 77

8.10 OCSN IF implementation schematic . . . 78

8.11 Graph of the OCSN IF FSM . . . 79

8.12 signal interface of an OCSN Switch . . . 80

8.13 signal interface of the addr compare component . . . 80

8.14 OCSN switch implementation schematic . . . 81

8.15 OCSN application component basic schematic . . . 84

8.16 OCSN Ethernet Bridge FSMs . . . 85

8.17 OCSN Ethernet Discovery Protocol . . . 86

8.18 Crossbar Interconnection Schema . . . 87

8.19 CSN Crossbar Switch Signal Interface . . . 88

8.20 CSN Crossbar Switch Implementation Schematic . . . 89

8.21 CSN2OCSN Bridge Signal Interface . . . 90

10.1 MRP Measurement Configuration for Setup 1 . . . 101

10.2 Floorplan of the reconfiguration platform . . . 103

10.3 Floorplan with interconnects of the reconfiguration platform . . . 105

(15)

List of Tables

1.1 Configuration speed and -time for a Xilinx xc5vlx330 FPGA . . . 2

1.2 Configuration speed and -time for a Xilinx xc5vlx330 FPGA with 0,25MB Data . . . 3

2.1 Truth table of a Halfadder . . . 10

2.2 different Boolean functions implemented with a 4 to 1 multiplexer . . . . 11

2.3 Example LUT implementing∧,∨ and⊕ . . . 13

5.1 Classification of a bidirectional ring . . . 39

5.2 Classification of a bus . . . 42

5.3 Classification of an open grid (mesh) with 4×4 nodes . . . 43

5.4 Classification of a closed grid (illiac) with 4×4 nodes . . . 44

5.5 Classification of a tree . . . 44

5.6 Classification of a crossbar network with n nodes . . . 46

7.1 variable speed of the OCSN . . . 55

8.1 Address to register mapping . . . 91

10.1 Area usage of the MRP . . . 98

10.2 Maximum clock rates within each switch . . . 101

10.3 Propagation Delay Matrix for all CEBs in ns . . . 102

(16)

(17)

7.1.1 Physical Layer . . . 55 7.1.2 Data-link Layer . . . 55 7.1.3 Network Layer . . . 55 7.1.4 Transport Layer . . . 57 7.1.5 Session Layer . . . 58 7.1.6 Presentation Layer . . . 58 7.1.7 Application Layer . . . 58 7.2 Support Platform . . . 58 7.2.1 GPIO . . . 59 7.2.2 BRAM . . . 60 7.2.3 DDR3 RAM . . . 60 7.2.4 UART Bridge . . . 60 7.2.5 Ethernet Bridge . . . 61 7.2.6 Soft-core SoC . . . 61 7.3 Reconfiguration Platform . . . 61 7.3.1 ICAP . . . 62 7.3.2 CEB . . . 62 7.3.3 CSN . . . 64 7.3.4 IOB . . . 66

7.4 Operating System Support . . . 67

7.5 Design Flow . . . 68

8 Implementation of the Multicore Reconfiguration Platform 71 8.1 General Components . . . 71

8.1.1 Clock Domain Crossing . . . 71

8.1.2 Dual Port Block RAM . . . 72

8.1.3 FiFo Queue Component . . . 72

8.2 OCSN . . . 73

8.2.1 OCSN Physical Interface Components . . . 73

8.2.2 OCSN Data-Link Interface Component . . . 75

8.2.3 OCSN Network Component . . . 80

(20)

Contents

8.3 CSN . . . 86

8.3.1 Physical Layer Implementation . . . 87

8.3.2 Network Layer Components . . . 87

8.3.3 Application Layer Components . . . 89

9 Operating System Support Implementation 93 9.1 OCSN Network Driver . . . 94

9.2 OCSN Network Device Driver . . . 96

10 Evaluation 97 10.1 Area Usage . . . 97

10.2 Maximum CSN Propagation Delay Measurement . . . 99

10.2.1 RO-Component . . . 99

10.2.2 ReRouter-Component . . . 100

10.2.3 Measuring Setup . . . 100

10.2.4 Measurement Results . . . 100

10.3 Example Microcontroller Implementation for MRP . . . 104

11 Conclusion 109 11.1 Outlook . . . 110

Appendix 113 A OCSN Frame Types . . . 113

(21)

1 Introduction

Gordon E. Moore[8] stated in 1965 in the growingIntegrated Circuit (IC)market context:

“The complexity for minimum component costs has increased at a rate of roughly a factor of two per year.” The main conclusion of his paper is that the density of transistors on

a IC periodically doubles. This prediction still holds after 48 years, according to Intel

employees Mark T. Bohr, Robert S. Chau, Tahir Ghani and Kaizad Mistry[9].

ICs, such as general-purpose processors, are now produced in a 14nm technology

process. Figure 1.1 displays the history of processing sizes for ICs of the last decades.

With every doubling of the transistor density, more logic components can be placed

onto one IC. Processor designers are using this newly available space to add more and

more Central Processing Unit (CPU) and Graphical Processing Unit (GPU) cores to

processors. For example the OpenSPARC T2 processor[10] has 8 CPU cores, and the

NVIDIA Fermi device[11] even has 512 GPU cores. This development is expected to

continue for a while, equipping general-purpose processors with more parallel computing

power. System on Chips (SoCs)are another product of the available space onICs. They

feature single and multicore processors combined with aGPU and additional accelerator

hardware. This accelerator hardware improves the computing power withDigital Signal

0 2000 4000 6000 8000 10000 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 Size in nm Year

(22)

1 Introduction

File Size (MB) Interface Bit-width Clk (MHz) Speed (Mb/s) Time (ms)

9,6 SelectMap 8 50 400 192

9,6 SelectMap 16 50 800 96

9,6 SelectMap 32 50 1600 48

Table 1.1: Configuration speed and -time for a Xilinx xc5vlx330 FPGA

Processors (DSPs)or other mathematical functions implemented in hardware.

Beyond exploiting the available space with more and more static hardware, it can also be used for adding reconfigurable hardware.

1.1 Reconfigurable Hardware

Reconfigurable hardware has the ability to change its function after chip assembly and

allows the configuration of every digital circuit, such as Advanced Encryption Standard

(AES)-, Fast Fourier Transformation (FFT) accelerators, other DSP like instructions

and even some specialised CPU cores. The industry has already reacted to the

impor-tance of reconfigurable hardware and produces different types of standalone ICs with

this feature. One example is theField Programmable Gate Array (FPGA). It features a

large reconfigurable hardware area, some accelerator components likeArithmetical

Log-ical Unit (ALU) and Multiplyer Unit (MPU), and distributedRandom Access Memory

(RAM). Chapter 2 gives a more detailed introduction to reconfigurable hardware and

commercially available ICs. From now on, we will use FPGA as a synonym for

recon-figurable hardware.

One important limitation ofFPGAs was that they had to be reconfigured completely,

even for small system changes. Every computation taking place in hardware had to be stopped and a programming file, representing the changed functionality, was loaded into

the FPGA. Even, if only half of the reconfigurable area was computing and the other

half was without functionality, the whole area had to be replaced. This was and still is a very time intensive task. It takes many milliseconds for the reconfiguration process to complete, depending on the size of the file and the configuration channel. This process erases the internal states of all configured hardware components. Table 1.1 presents the

calculated minimal configuration times for a Xilinx FPGA and a 9,6MB configuration

file using the fastest available configuration interface.

1.1.1 Runtime Reconfiguration

Because of the configuration time limitation and to enable replacing one part of a design while other parts are still doing computations, hardware vendors introduced the concept

ofruntime reconfiguration. Runtime reconfiguration is also often referenced as dynamic

reconfigurationorpartial runtime reconfiguration. Such a runtime reconfigurable project

(23)

1.2 Hybrid Hardware Approaches FPGA RM1 RM0 RM00.bit RM01.bit RM02.bit RM03.bit RM10.bit RM11.bit RM12.bit RM13.bit ,,static´´ Logic

Figure 1.2: partitioning of an FPGA for the Xilinx PR design flow[2]

the design phase. Figure 2.7 shows an example partitioning of a FPGA for use with

the Xilinx Partial Reconfiguration (PR) design flow[2]. This design flow targets partial

reconfiguration for XilinxFPGAs. Two different sizedRMsare available, each connected

to some special “static” control hardware.

This feature does not speed up the configuration process itself, but through the parti-tioning of the reconfigurable area the size of the individual configuration stream shrinks,

which reduces the time for the reconfiguration process of one RM. For example, if you

can reduce the size of the configuration stream for one RM to 0,25 MB, you achieve

the configuration times of Table 1.2. This is an enormous speed up, but it can only be

achieved, if the design is apportionable and the RMs can be reconfigured individually

rather than all at once.

The partitioning of aFPGAcan only be altered by a full replacement of the configured

logic. More benefits ofPR are summarized by Kao[12].

1.2 Hybrid Hardware Approaches

Systems combining a general-purpose von Neumann[13] CPU with some kind of

config-urable or reconfigconfig-urable area are often called Hybrid Hardware Systems.

The industry has already produced some hybrid systems, such as the Xilinx Zynq architecture[7], the Intel Atom processor E6X5C series[14] and the Convey HC1/HC2[6].

The first combines an ARM Cotex A9 processor core with a XilinxFPGA on the same

File Size (MB) Interface Bit-width Clk (MHz) Speed (Mb/s) Time (ms)

0,25 SelectMap 8 50 400 5

0,25 SelectMap 16 50 800 2,5

0,25 SelectMap 32 50 1600 1,25

Table 1.2: Configuration speed and -time for a Xilinx xc5vlx330 FPGA with 0,25MB Data

(24)

1 Introduction

chip, but not on the same die. The next combines an Intel Atom processor with an

AlteraFPGAin the same manner. The last interconnects one Intel Xeon processor with

four Xilinx FPGAs through the Intel co-processor interface. Still missing are hybrid

hardware systems combined on a single die.

Extending a static processor core with some kind of reconfigurable hardware has al-ready been the focus of research. The following classes of combining strategies have already been evaluated.

1.2.1 Datapath Accelerators

Hallmannseder[15], Dales[16], Hauser et al. [17] and Razdan[18] added reconfiguration

directly into processor cores by adding reconfigurable accelerator units to the datapath of the processor. These units are small and cannot be merged to form larger ones. They

improve the processor performance by exploiting Instruction Level Parallelism (ILP)

through additional computational datapath units, or by extending the Instruction Set

Architecture (ISA)with special instructions. Examples of these special instructions are

cryptograhic accelerators for AES and mathematical accelerators for FFT. Datapath

accelerators can improve the performance the most, if they are tightly integrated into the processor core without long interconnects.

1.2.2 Bus Accelerators

Bus accelerators are small to medium-sized reconfigurable components and can be con-figured with specialised hardware to improve the runtime of a specific part of a program. They are connected through a bus or a network to the processor. These accelerators have to work independently on some part of data because of the high bus/network la-tency. This can release the static core(s) of some portion of parallel computable data. Because of the independent nature of these accelerators, they have an internal state and sometimes a connection to the main memory of the system. Bus Accelerators are a very simple form of extending the performance of processor cores because existing Busses,

like Peripheral Component Interconnect (PCI) or Universal Serial Bus (USB), can be

used, but more tightly coupled interconnects are also possible.

1.2.3 Multicore Reconfiguration

TheRuntime adaptive multiprocessor system-on-chip (RampSoC)framework of Gohringer

et al.[4, 19] evaluates themulticore reconfiguration approach. With Multicore

Reconfig-uration, multiple processor cores can be configured at system runtime. The system can adjust itself to the nature of the current problem to solve. Some kind of dynamic or

run-time reconfiguration design flow implements RMs, each containing one processor core.

These processor cores are called softcores because they are not staticly implemented.

The size of the largest one defines the size of the smallestRM, if every processor core

(25)

1.3 Thesis Objectives

sized processor cores, but this reduces the number of usable processor cores of the same size.

1.3 Thesis Objectives

Most of the research about hybrid hardware systems focuses on one combining class only, is always using a fixed number of static sized cores or units, and includes only high performance computing applications. This is also true for industrial products.

These restrictions limit the number of application scenarios for each architecture. To deploy hybrid hardware in a general-purpose environment and to support many ap-plications, the number and the size of the components has to be variable. Example applications benefiting from hybrid hardware in general-purpose computing are: im-age processing applications, simulation of electromagnetic fields, solid state physics and computer games. Image processing applications could use hybrid hardware to accelerate certain filter and transformation algorithms by uploading accelerator units into the re-configurable hardware. The simulation of electromagnetic fields and solid state physics can accelerate their computations by offloading certain calculations to the reconfigurable hardware. Both fields already use modern graphic cards to accelerate their computa-tions on general-purpose hardware. Reconfigurable hardware would enable developers to use more specialised hardware and increase the calculation power even more. Com-puter games also use modern graphic cards to accelerate physical calculations for their simulated world. Hence, with reconfigurable hardware, each computer game could bring its own hardware for doing such calculations. All these reconfigurable hardware can be implemented as an accelerator unit or multiple streaming processor cores. Individualis-ing hardware for each computer application can increase the processIndividualis-ing power or reduce the power consumption of the whole system. Often, applications in a general-purpose environment are running concurrently, inducing the requirement of a variable number and a variable size of reconfigurable modules. This all-purpose computing capabilities requires more flexible design rules than systems supporting just one combination class.

Computer systems are divisible into single-purpose computers, multipurpose comput-ers and general-purpose computcomput-ers. Single-purpose computcomput-ers are designed for a specific calculation. In this systems reconfiguration is used to update the system and to fix devel-opment mistakes. This is already very common. Multipurpose computers are specialised for a group of computations, such as audio and video processing. A typical

multipur-pose computer is a DSP. In some DSPs reconfigurable accelerator units are available.

They enable developers to extend the functionality or integrate new algorithms. The last computation class, the general-purpose computers, lacks support for reconfigurable hardware at the moment. This situation shall be changed by this thesis.

As mentioned earlier, theFPGAhas to be partitioned into multiple modules to support

runtime reconfiguration. This partitioning is fixed after the initial system design stage. This early stage floorplaning leads to the granularity problem of runtime reconfigurable design flow because different sized components shall be runtime reconfigurable with

(26)

1 Introduction

sized component determines the size of one module. This module size and the size of

theFPGAdetermines the number of available reconfigurable modules, which leads to a

very inefficient design, if components with very different sizes are used. This granularity problem, and the solution proposed in this thesis, are described more in Chapter 6.

Deploying hybrid hardware into general-purpose computing leads to another problem. At the moment it is relatively easy to write platform-independent programs by using a higher level programming language like C. Languages like Java are ignored because the programs are running in a runtime virtual machine, not on the bare hardware[20]. Virtual machines could be another target for hardware support in general-purpose

com-puters. One advantage of current general-purposeCPUs is, that all of them are based

on the von Neumann architecture[13]. This simplyfies the development of platform in-dependent code because a compiler can be written for all architectures, with the same

base assumptions, only differing in theISA. Writing platform independent programs for

hybrid hardware is much more complicated because these programs consist of software and hardware parts. The reconfigurable hardware in such a system is called configware. While the software part can still be written in C and is based on the von Neumann

ar-chitecture, the differentFPGAand CPU vendors have not agreed upon an architecture

for the hardware part yet. It cannot be expected that all these companies decide for the same reconfiguration approach for their hybrid hardware system. This complicates the development of the configware because developers have to describe hardware for different reconfiguration approaches.

Both problems — the granularity problem and the development of platform

indepen-dent code — are addressed in this thesis by implementing a multi FPGA framework

called Multicore Reconfiguration Platform (MRP). This framework uses a new

floor-planing technique for partitioning the FPGAs, and a Circuit Switched Network (CSN)

for interconnecting all the RMs. This combination of floorplaning and interconnection

network enables the framework to support a variable number of different sized

reconfig-urable components, only limited by FPGAsize, in contrast to all other, at the moment

available systems. This is achieved by dividing larger components into multiple smaller

components, which fit into the RMs and interconnecting them through the CSN. This

framework also simplifies the development of platform independent software and

con-figware because the framework can be synthesised for anyFPGA. It abstracts from the

underlyingFPGA and provides the sameApplication Programming Interface (API)for

every hybrid hardware developer.

The proposed floorplaning technique of the MRP and the CSN generate a medium

sized hardware overhead. Because of this overhead, theFPGAsize is a limiting factor in

the evaluation process. To overcome this restriction, the MRP supports a flexible and

easily extensible packet switched network, calledOn Chip Switching Network (OCSN).

It allows intra FPGA communication for configuring the RMs and programming the

CSN, and also inter FPGA communication, to combine multiple FPGAs to form a

larger hybrid hardware system. This feature is also a novelty, like the solution to the granularity problem and the platform independence of configware.

(27)

1.4 Thesis Structure

The thesis is organised in eleven chapters. The introduction in Chapter 1 briefly de-scribes the frame and the objectives of the thesis. To understand hybrid hardware, the

principles of reconfigurable hardware, FPGAs, and runtime/dynamic reconfiguration

are introduced in Chapter 2 and some example Reconfigurable Systems (RSs), related

to the MRP, are presented in Chapter 3. The MRP uses two different kinds of

Net-work On Chips (NOCs), the CSN and the OCSN. Chapter 4 introduces the principles

of NOCs. It describes the Open Systems Interconnection Model (OSI) and presents a

network classification based on work done by Schwederskiet al.[21] and Feng[22]. Some

important interconnection networks are described and rated according to this classifi-cation in Chapter 5. After the introduction of all basic principles, Chapter 6 explains

the granularity problem of runtime reconfigurable design flow, occurring, if FPGAs are

divided into multiple RMs to support flexible PR designs and describes possible

solu-tions to the problem. The main work of the thesis, theMRP, is presented in Chapter 7.

It introduces the CSN, OCSN and the design of the RMs. Chapter 8 describes the

implementation of the MRP in more detail. Because the MRP is designed as a hybrid

system it needs support from the Operating System (OS). The required device drivers

are described in Chapter 9. The verification, proving that the MRP is usable and

al-lows the reconfiguration of multiple different sized computing elements, is presented in

Chapter 10. It evaluates the MRP according to area usage, maximum clock speed and

example implementations. The conclusion of the thesis results and an outlook to future work is given in Chapter 11.

(28)

(29)

2 Reconfiguration Fundamentals

Reconfigurable hardware describes some kind of electronic circuit, whose Boolean

func-tion can be changed or reconfigured after production of the circuit. Such hardware

supports the creation of variable and specialised components the moment they are re-quired. Different approaches exist to build basic elements of reconfigurable hardware.

These basic elements can be combined to form larger systems and are produced asICs,

such asFPGAs,Programmable Logic Arrays (PLAs),Complex Programmable Logic

De-vices (CPLDs)andProgrammable Array Logics (PALs). The most important difference

between these systems is their basic reconfigurable component. FPGAs are build out

of LookUpTables (LUTs), while PLAs, PALs and CPLDs use arrays of and/or

matri-ces to configure Boolean functions. Another approach on reconfigurable hardware uses

multiplexers. All the reconfigurable ICs can be used to build RSs or hybrid hardware

systems. These systems often combine a general-purpose processor with some reconfig-urable hardware to improve the computational power of the processor. This approach is

called Reconfigurable Computing (RC). The following sections give a short introduction

to reconfigurable hardware. Compton et al.[23] provides a more detailed overview of

reconfigurable hardware and related software.

2.1 Matrix Approach

The basis for the matrix approach is the and/or matrix. Figure 2.1 shows an example

0 1 d c b a

&

a b c d y0 y1

(30)

2 Reconfiguration Fundamentals

matrix. On the left side, the and matrix prepares the connection of input signals, the

negated input signals, a zero and a one signal to some and-gates. None of the vertical

signals are connected to the horizontal ones at the moment. The intersections of these

signals are connected to a programmable switch, such as an electronic fuse or a Static

Random Access Memory (SRAM) cell. An electronic fuse will make the matrix

one-time programmable, while the SRAM or other memory types will cause a multiple

programmable matrix. On the right side, the or-matrix prepares the connection of the

and-gates to some or-gates. The intersections of the signals are used the same way as

the intersections of theand-matrix. To configure a Boolean function of typef :_Bn→_B

into thisand/or matrix, the function is required inDisjunctive Normal Form (DNF). A

DNF is the normalisation of a logical function, displayed as a disjunction of conjunctive

clauses. Every logical function, without quantifiers, can be converted to DNF[24].

a b S Cout

0 0 0 0

0 1 1 0

1 0 1 0

1 1 1 1

Table 2.1: Truth table of a Halfadder

0 1 d c b a

&

a b c d Cout S

Figure 2.2: Halfadder implemented in an and/or Matrix

Figure 2.2 displays an example implementation of a HalfAdder with the truth table

given in Table 2.1. The formulas forS and Coutcan be read out of the truth table:

S = (a∧ ¬b)∨(¬a∧b),

(31)

2.2 Multiplexer Approach

Both are in DNF and can be directly implemented into an and/or Matrix. The nodes

in Figure 2.2 represent connections at the intersection points of the signals. Three forms of expressions exist for the matrix approach.

• The and and theor matrix are programmable.

• Only the and matrix is programmable, the or matrix has a fixed programming.

• Only the or matrix is programmable, the and matrix has a fixed programming.

DifferentICs use different expressions of the matrix approach.

2.2 Multiplexer Approach

A multiplexer is a small digital selector device. It routes one of n input signals to its

output. The number of input signals depends on the number of selection signals. If x

selection signals are available, the multiplexer can process 2x input signals. Figure 2.3

shows a 4 to 1 multiplexer with data inputs e0. . . e3 and selection inputss0 and s1.

4-1 MUX 00 01 10 11 s0 s1 y e0 e1 e2 e3 Figure 2.3: 4 to 1 Multiplexer

Simple Boolean functionsf :_B×_B→_Bcan be build out of this multiplexer by using

s0 and s1 as the input variables and assigning each of the data inputs the results of the

function. Table 2.2 shows how to implement the logic functions∧,∨and⊕with a

mul-e0 e1 e2 e3 function

0 0 0 1 f(s0, s1) =s0∧s1

0 1 1 1 f(s0, s1) =s0∨s1

0 1 1 0 f(s0, s1) =s0⊕s1

Table 2.2: different Boolean functions implemented with a 4 to 1 multiplexer

tiplexer. To make this approach reconfigurable to different Boolean functions, FlipFlops

(32)

2 Reconfiguration Fundamentals 4-1 MU X 00 01 10 11 s0 s1 e0 e1 e2 e3 4-1 MU X 00 01 10 11 s0 s1 e4 e5 e6 e7 4-1 MU X 00 01 10 11 s2 s3 y 4-1 MU X 00 01 10 11 s0 s1 e8 e9 e10 e11 4-1 MU X 00 01 10 11 s0 s1

e12 e13 e14 e15

Figure 2.4: Cascaded 4 to 1 Multiplexer

functions can be configured. This pattern can be extended to implement functions of

typef :_Bn_→

Bby cascading multiplexers. An example is given in Figure 2.4. There are

two additional input variables available: s2 and s3. Hower, this pattern does not scale

because for every two input variables the required number of multiplexers quadruples. Another method to increase the number of input variables is to increase the number

of selection signals, but this will not scale either due to signal fanning. For x selection

signals 2x input signals are required.

Functions of type f :Bn→Bm have to be split in m functions of typef :Bn→Bto

be implementable with the multiplexer pattern.

2.3 Look Up Table Approach

A better solution to implement reconfigurable functions of typef :_Bn →_B is to use a

smallRAM orLUT. The address signals of theRAM are used as the input parameters

and the data words hold the result of the function. Table 2.3 displays the implementation

of the simple Boolean functions ∧,∨ and ⊕ in a LUT with an address width of three

and a data width of eight. Because only two operands are required for these operations,

a1 and a2 are selected as the input variables. The result is encoded in the dataword,

starting from the first left bit for ∧.

It is obvious that theLUT approach supports the calculation of multiple functions of

type f :Bn→Bconcurrently by using different bits of the data-word as the result.

This approach is better suited for the calculation of f :_Bn→_Bm _{functions than any}

other presented approach because it only requires oneLUT, as long asm is less or equal

the size of one data word. For functions withmgreater the size of one data word,LUTs

(33)

2.4 Field Programmable Gate Arrays a0 a1 a2 Dataword (8bit) 0 0 0 00000000 0 0 1 01100000 0 1 0 01100000 0 1 1 11000000 1 0 0 00000000 1 0 1 00000000 1 1 0 00000000 1 1 1 00000000

Table 2.3: Example LUT implementing ∧,∨and ⊕

2.4 Field Programmable Gate Arrays

To extend boolean functions as explained in previous subsections to Finite State

Ma-chines (FSMs)or even more compley circuits it is necessary to have memory and

inter-connects.

ManyIC provide the required resources to configure digital circuits, such as FPGAs,

PLAs, CPLDs and PALs. This section describes the general structure of FPGAs

be-cause they are used for the prototype system in this thesis. Many books provide this

information, but this section is based on the book by Urbanskiet al.[25]. In contrast to

the name, aFPGAis not an array of gates, but an array of configurable basic elements,

as there areConfigurable Logic Blocks (CLBs),Input/Output Blocks (IOBs),Block RAM

(BRAM), small DSPs and Clock Management Tiless (CMTs). Figure 2.5 displays the

IOB IOB IOB IOB IOB IOB

CLB CLB CLB CLB CLB CLB

Figure 2.5: Simple structure of an FPGA without interconnects

basicFPGAstructure withCLBsandIOBs, and without interconnects. They are

organ-ised in an array structure to simplify the interconnection of the blocks. All components

of theFPGAare vendor and device specific. The focus here is on Xilinx Virtex5FPGAs.

(34)

2.4.1 Input/Output Blocks

IOBsare the interface from the configured hardware to the input and output pins of the

FPGA. They are also configurable by the developer to support different voltage levels

and input/output signal standards, such as Low-Voltage Transistor Transistor Logik

(LVTTL),Low-Voltage Differential Signaling (LVDS), andHigh-Speed Transceiver Logic

(HSTL).

2.4.2 Configurable Logic Blocks

CLBs are the main reconfigurable elements of the Virtex5 FPGAs. Figure 2.6 displays

Slice X0Y1 Slice X0Y0 Slice X1Y0 Slice Y1Y1 SHIFT CIN COUT Switch Matrix Fast Connects to neighbors CIN COUT

Figure 2.6: Structure of two Virtex5 CLBs[3]

the structure of two CLBs. The switch matrix is already part of theFPGAs

intercon-nection network. OneCLB consist of two slices. These slices are tightly interconnected

through carry lines to increase the operand size of Boolean functions. Always twoCLBs

are connected through a shift line to form large shift registers.

Every slice contains fourLUTs, which are the basic reconfigurable elements ofFPGAs,

four storage elements, wide-function multiplexers, and carry logic[3].

The used LUTs have six independent inputs and two independent outputs. This

structure supports the configuration of one Boolean function of type f : _B6 → _B or

two Boolean functions of type f : B5 → B if the two functions share the same input

parameters. Three multiplexers are connected to the four LUTsin one slice to support

combining twoLUTsto increse the number of possible inputs to seven or eight. Functions

with more inputs are implemented by combining slices.

D-typeFFsprovide storage functionality within each slice. Their input can be directly

driven from aLUT. Some special slices provide more storage capacity by mergingLUTs

(35)

2.4 Field Programmable Gate Arrays

2.4.3 Block RAM

FPGAs support BRAM to provide reconfigurable hardware with fast and area

inexpen-sive RAM. On XilinxFPGAs BRAM is provided in 36kbyte blocks. They are placed in

columns on theFPGA. The number of available blocks isFPGAdependent. For Virtex5

devices the available BRAM ranges from 144 kbytes up to 2321 kbytes.

BRAM can be used as single port, dual portRAM, or as First In First Out (FIFO)

queues. Virtex5FPGAseven provide dedicated hardware for asynchronousFIFOqueues,

reducing space requirements of the reconfigurable hardware. Access times forBRAM are

very fast, compared to off-chipDouble Data Rate (DDR) RAM. A dataword is available

one clock tick after issueing the address into the RAM, making it a good choice for fast

buffers or caches.

2.4.4 Special IO Components

Often, reconfigurable hardware requires special I/O components, such as Ethernet,Serial

Advanced Technology Attachment (SATA),PCI, etc.. Implementing these I/O

compo-nents in reconfigurable hardware is possible, but requires much FPGA space.

There-fore, theFPGAs support some special non-reconfigurable I/O hardware. This hardware

implements common parts of I/O devices, which can be used to create the required

components. The Virtex5 FPGAfamily supports Ethernet MACs, and RocketIO GTP

Transceivers.

Ethernet MACs reduce the area usage for Ethernet devices because they implement

theMedia Access Control (MAC) layer of the Ethernet protocol.

RocketIO GTP Transceivers support general components for high speed serial I/O like 8b/10b encoders/decoders and fast serialiser and deserialiser. These transceivers can be

used to implement the physical layer of thePCI orSATAbus. The correct working mode

can be set through special instructions in the Hardware Description Language (HDL).

2.4.5 Interconnection Network

The interconnection network and theCLBs are the most important parts of theFPGA.

Without the interconnection network the CLBs can not be combined and larger

com-ponents can not exchange data. FPGAs distinguish three different signal types, which

have to be routed through the interconnection network with different priorities and signal latencies.

clock signals Clock signals require a fast distribution time throughout the FPGA

because they synchronise all the components to its rising or falling edge.

reset signals Reset signals are similar to clock signals. Through reset signals

com-ponents are initialised at the same moment. This also requires a fast distribution

throughout theFPGA.

I/O signals For I/O signals a fast distribution is also important, but the maximum

(36)

Another important requirement for I/O signals is their number. A normal design only has around one to three different clock signals and as much reset signals, but the number of I/O signals are very huge.

Therefor, the FPGAs support two different interconnection networks. One for clock

and reset signals and one for all the I/O signals, required to exchange data between components.

2.5 Partial Reconfiguration

PR is a feature and a design flow of Xilinx Virtex5, Virtex6, and Virtex7FPGAs[2]. It

extends the normal configuration possibility of FPGAs with the ability to modify parts

of a running configuration, without interrupting the computation.

The design is divided into a static and a reconfigurable part during development. For the static part special entities, called reconfiguration modules, are defined, which hold the reconfigurable components. This definition includes a signal interface declaration for communicating with the static part. There can be different reconfiguration modules in one design with variable number of instances. The reconfigurable part of the design consist of entity descriptions for every component, which should be configurable into one module. FPGA RM1 RM0 RM00.bit RM01.bit RM02.bit RM03.bit RM10.bit RM11.bit RM12.bit RM13.bit ,,static´´ Logic

Figure 2.7: simple PR example[2]

The synthetisis process creates someFPGAconfiguration files. The main file includes

the static design and a component for each instance of a reconfiguration module. For every component and every instance an additional partial configuration file is created.

These files can be loaded into the FPGA after the main file to reconfigure certain

re-configuration module instances. Figure 2.7 shows a simple example of a reconfigurable system. It features two reconfiguration module instances and four partial configuration

files per module. Instances can only be configured into the RMs for which they have

(37)

3 Example Reconfigurable Systems

3.1 Research Systems

3.1.1 RampSoC

ARampSoC is aMulti-Processor System-on-Chip (MPSoC)that can be adapted during

run-time by exploiting dynamically and partially reconfigurable hardware[4]. A special design-flow is used, which combines the top-down and up approach. The

bottom-up approach is used during design time to set bottom-up the basic conditions of a RampSoC

according to the problem-space it should be used in. In the top-down approach the software is optimised for this initial setup. Parts of this initial setup can be reconfigured to meet arising needs of applications during runtime, such as a different processor core

or a special accelerator unit. Figure 3.1 shows a possible RampSoC configuration at

FPGA Switch Switch Micro-Processor (Type 1) Accelerator Micro-Processor (Type 1) Accelerator Micro-Processor (Type 2) Accelerator Switch Switch Micro-Processor (Type 1) Accelerator Micro-Processor (Type 1) Accelerator Accelerator Micro-Processor (Type 1) Accelerator Accelerator Switch Switch

(38)

3 Example Reconfigurable Systems

some point in time. Two types of processor cores are supported in this configuration, each having at least one accelerator unit. Switches connect the individual cores to the communication network.

The implementation of aRampSoC is done using the early access PR concept of

Xil-inx. This design flow is not supported by the Xilinx toolchain anymore. The early

access PR design flow requires, that reconfigurable modules are defined before

synthe-sis of the project. To reconfigure different cores, accelerators and the communication infrastructure all reconfigurable parts have to be defined at the system design stage. The maximum number of accelerators and processor cores is fixed during runtime. The developer has to decide, if each type of core requires its own reconfiguration module defined or if the biggest core size is selected as the size for the reconfiguration unit. He

has to balance between space exploitation and flexibility. TheRampSoC approach uses

proprietary processor cores, such as Pico- and Microblaze cores from Xilinx. To this cores accelerator units are connected, which can change their hardware function while the processor is executing a program.

TheRampSoC approach is a very flexible improvement compared to normal

multicore-processors or MPSoCs. Its heterogeneous structure allows the optimal execution of

applications with different hardware requirements and can adapt to applications needs

during runtime very easily. Processor cores can even be exchanged by special FSMs

supporting calculations in special hardware components.

3.1.2 PRHS

The Partial Reconfiguration Heterogenous System (PRHS)developed by Eckert[5] tries

to exploit the available new space onICsalso by reconfiguration. ThePRHS is a softcore

SoC, configured onto aFPGA. It features oneRM of the XilinxPRdesign-flow. In the

availableRM different hardware components can be configured. TheRM can accelerate

computations on the SoC, but its main pupose is virtualisation.

Virtualisation in this case means the instantiation of a full SoC running under the

supervision of the static core. The virtualised SoC also runs Linux as OS. Figure 3.2

displays this scenario. The static system on the right is running Linux as its OS. It

has full access to memory and memory mapped IO hardware components likeUniversal

asynchronous receiver/transmitters (UARTs) or timers. On the left a RM is available

and connected to the static system. TheSoC configured at runtime into thisRM has

only partial access to the memory. The accessible memory space is configured from the static system before the virtualised system is started. A memory mapped IO component

interconnects the RM and the static system. It supports starting and stopping the

virtualised system, but not suspending it. Providing a virtualised hard-disk to the

reconfigurable system is another feature of the static system.

The PRHS is an interesting way of using tighly couple reconfigurable hardware from

a static processor core. The virtualised processor cores can feature different ISAs and

(39)

3.1 Research Systems pr ocessor (pr hspA) dataCa che (Cac he) instrBu sCtr l (BusCt rl) instrC ach e (Cac he) systemArbiter (prhsSDbusAr biter) dataBusC trl (BusCt rl) ClockSour ceT imer (timer4pr hs) uart 0 (uar t4pr hs) SysIntC hip (intch ip4pr hs) ClockE ventT imer (timer4pr hs) B CS (busC omponentSta tus) bootR am (bram4 pr hs) 30 32 primary instru ction bu s primary data bu s secondar y instru ction bu s secondar y data bu s pr ocessor _{data bu} s pr ocessor instru ction bu s nIRQ data SD bus instruction SD bu s 30 32 timers a nd ua rt0 pr esent infor mation RS232 T x/Rx lines 0 icBusStat usLines icnExt Inter rupts static PRHS SD Bus staticSys (base) uart 1 (uar t4pr hs) RS232 T x/Rx lines 1 28 <option base> 28 PR e xtension or uar t1 pr esent infor mation ReconfArbiter (prhsSDbusAr biter) PRHS Bus <option base> reconf PR HS SD Bus <option reconf> <option r econf> PRHS SD Bus baseR econf reconfIF4 pr hs icap4pr hs recon ﬁ guration guard recon ﬁ gurable m odule PR e xtension_in st (PR e xtension)

(40)

3.1.3 Dreams

Dreams is not directly a RS, but it is a tool to build runtime reconfigurable systems.

It processes Xilinx Description Language (XDL)files, created by the Xilinx tools, and

provides a partial reconfiguration design flow on top ofPR. While the Xilinx design flow

enforces the developer to run the synthesis, place, and route process for everyRM and

every implementation of a module, the dreams design flow does not. It supports easy

relocation ofRMs just synthesised, placed and routed one time.

XDL is a human readable language for describing netlists. It is compatible with the

ncd netlist file format and Xilinx provides programs for easy conversion.

Dreams is developed by Otera et al.[26]. It tries to improve the Xilinx design flow in

four different ways:

1. Module relocation in any compatible region in the device 2. Independent design of modules and the static system 3. Hiding low level details from the designer

4. Enhanced module portability among different reconfigurable devices

Its design flow targets reconfigurable architectures build out of disjoint rectangular re-gions.

The system architecture, enforced by the Dreams tool, is divided intoVirtual Regions

(VRs)andVirtual Architectures (VAs). AVAcombinesFPGAresources for use as aRM

or static module. TheVAdescribes the full system, including static and reconfigurable

parts and how they are interconnected using the FPGAs interconnect. The VRs and

the VA description are provided by Extensible Markup Language (XML) files by the

developer.

Dreams is a very interesting tool. Very large reconfigurable systems suffer in the Xilinx

PRdesign flow from very long placement and routing times. Dreams could significantly

reduce these times and improve the development time of such systems.

3.2 Commercial Systems

3.2.1 Convey HC1

One commercially availableRS is the Convey HC1[6]. It combines four Xilinx Virtex5

FPGAs with an Intel Xeon processor through the X86 co-processor interface. Figure 3.3

gives an overview of this architecture. The system contains two memories, one connected

to the processor cores and another one connected to the fourFPGAs. Both are accessible

from the processor and the FPGA side. Hardware ensures cache-coherency between

them. The memory on the FPGA side is specially partitioned to support concurrent

access to different memory banks from differentFPGAs to increase the overall memory

(41)

3.2 Commercial Systems

"Commodity" Intel Server Intel 5138 Dual Core Processor Intel x86-64 Server x86-64 Linux Intel IO Subsystem Intel 5400 MCH Memory

Convey FPGA-based coprocessor

Application Engine Hub Application Engines Virtex5 FPGA Virtex5 FPGA Virtex5 FPGA Virtex5 FPGA Memory FPGA based

Shared cache-coherent memory

Figure 3.3: Overview of the Convey HC1 architecture[6]

Communication with theFPGAsis implemented using the coprocessor interface of

In-tel processors. Software running on the Xeon processor can trigger hardware operations

running on one of the FPGAs by issuing special coprocessor instructions and writing

data, required for the operation, to special memory regions. Programs can change

con-figurations in idle times of theFPGA. The Xilinx PR design flow is basically available,

but is not supported yet by Convey, enforcing long reconfiguration latencies and very

fixed FPGAdesigns. Still, the Convey HC1 is a very interesting platform for high

per-formance computing. In high perper-formance computing the accelerator hardware seldom changes and one important factor is memory access. Memory access is very fast on the HC1 because of their special memory layout.

3.2.2 Intel Stellarton

Another commercialRS is the Intel Stellarton processor andFPGA SoC[14]. It combines

a standard Intel Atom Stellarton processor core with an AlteraFPGAon the same chip,

but not on the same die. Figure 3.6 gives an overview of its hardware structure. TheSoC

contains all the standard components of the Intel Atom processor, like DDR interface,

graphics adaptor/accelerator, audio component andPeripheral Component Interconnect

Express (PCIe) bus interface.

The AlteraFPGA[27] ist connected to the processor by this PCIe bus. Through this

bus the FPGA is configurable and application data can be exchanged between FPGA

and processor. The main purpose of this RS was to improve the performance of host

programs by accelerator hardware.

The production of the system has been discontinued, but a new approach by Intel seems to be on its way, according to Diane Bryant[28]. According to her, Intel is working

on combining their Xeon server processors with FPGAs to improve the performance of

(42)

Intel Atom Processor DDR2 IF

SPI, SMBus

Graphics Legacy

GPIO Intel Audio

PCIe Gen 1 PCIe PCIe

FPGA

Figure 3.4: Structure of an Intel Stellarton Processor, combined with an Altera FPGA

3.2.3 Xilinx Zynq Architecture

Zynq[7] is a very new hybrid hardware system produced by Xilinx. It features a dual

ARM Cortex A9 processor core connected to many peripherals and aFPGAthrough an

Advanced Microcontroller Bus Architecture (AMBA)bus. Figure 3.5 presents the overall

system structure. Processor core andFPGAshare the same chip, but not the same die,

like the Intel Stellarton processor. It supports a lot of static hardware components to

connect to common embedded devices, such asInter-Integrated Circuit (I2C)controller,

Serial Peripheral Interface (SPI) controller, or Controller Area Network (CAN)

con-troller. The FPGA is connected to the processor through an AMBA bus. The AMBA

bus is a very common bus in embedded devices. It supports general-purpose ports and

high performance ports from the processor to theFPGA. TheFPGAhas access to high

speed serial I/O transceivers going offchip and to theAMBAbus. All other features of

a Virtex7FPGA are also supported, includingPR.

The Zynq architecture is an interesting system for embedded hardware developers.

On the ARM processor cores a standard embedded OS can run and the FPGA can

improve calculation performance for special applications, like audio and video editing, radio transmissions, and cryptographic algorithms.

(43)

3.2 Commercial Systems

(44)

3.3 COPACOBANA and RIVYERA

FPGA 2 FPGA 3 FPGA 4 FPGA 5 FPGA 1 FPGA 0 FPGA 6 FPGA 7 Svc FPGA

Host Interface Backplane

Figure 3.6: COPACOBANA and RIVYERA interconnection overview

The Copacobana and Revyera systems developed by SciEngines hybrid hardware sys-tems optimized for cryptoanalysis and scientific computing.

Both systems consist of many interconnected FPGAs working together to solve a

problem. The host system is connected through 10Gbit Ethernet cards, 4Gb Fibre

Channel cards, or InfiniBand. The Copacobana can try the complete 56-Bit DES key space within 12.8 days. The Revyera is the advancement of the Copacobana.

(45)

4 Interconnection Networks

Modern hardware design often requires the development of some interconnected com-ponents. Different interconnection network schemes are available today. If more tightly coupled systems are required these components are combined on a single chip. Such a

tightly connected system is calledSoC.

Figure 4.1 displays an example mobile phone system, with three different

intercon-nection schemes. This system can be developed as a multi-chip system or as a SoC.

The shown mobile phone system consist of a CPU, memory, a DSP, a keypad, and a

Memory CPU RF DSP Keypad a) bus connection Memory CPU RF DSP Keypad b) P2P connection Memory CPU RF DSP Keypad c) noc connection Switch Switch

Figure 4.1: Example mobile phone SystemOnChip (SoC)

radio transceiver. These components interact in different ways to get the mobile phone running. The interactions can be implemented using different kinds of interconnection networks. Figure 4.1 shows three possible topologies. In a) all components are connected to a bus with the typical bus communication restrictions, such as exclusive bus access for a single component and poor scaleability. In b) all components are directly connected with all components they are interacting with. This network topology supports a very flexible communication, but requires many interconnection links. The last displayed topology is a packet switched network build out of the components and switches. This

kind of networks are calledNOCs. NOCs are very similar to the communication

infras-tructure of inter computer networks, such asLocal Area Networks (LANs)orWide Area

Networks (WANs).

Much more different network architectures exist. To distinguish these networks and to easily highlight their differences and performance properties a classification is necessary.

In this work part of the classification done by Schwederski et al. [21] is used, which is

(46)

4 Interconnection Networks

The base for a classification is usually a mathematical representation of the entity of interest. In this case finite graphs are a good representation of interconnection networks.

The edges of the graph model the interconnection links and the nodes are theProcessing

Elements (PEs), connected to the network. A PE is the component doing calculations

and using the network for communication purposes, such as a processor core, aDSP, or

some other kind of device controller.

This chapter is organised as follows: Section 4.1 describes theOSI. It is an industrial

standardising model for different communication protocols, simplifying their develop-ment.

The distinguishing characteristics of NOCs are explained and described from

Sec-tion 4.2 to SecSec-tion 4.8.

4.1 Open Systems Interconnection Model

Communication systems mostly consist of more than just two communication partners. These communication partners can be under the control of the same developer or com-pany, but this is not always the case. Data is transmitted over multiple nodes to reach its destination and the underlying infrastructure can differ from node to node because of different responsibilities. The transmitted data can be divided into a header, enclosing source and destination addresses, payload size, quality of service information, and the actual payload. The position of the header data and the payload has to be defined to help every developer and manufacturer to produce compatible hardware. Later in this

work, protocols will be described, using the terminology of theOSI.

The International Telecommunication Union (ITU) and the International

Organiza-tion for StandardizaOrganiza-tion (ISO)[29] developed the OSI model to simplify the definition

of communication protocols. Seven functional distinct layers divide the communication process. Figure 4.2 gives a graphical representation of these layers and the expected protocol flow. The flow starts at either side of the network stack. If some data shall be transmitted to another communication partner, the communication usually starts at the application layer. Every layer processes the data and passes it down to the next layer until reaching the physical layer. Each layer adds header information or transforms the data according the network requirements. Sometimes control messages are created, passed down the layers and send to their corresponding layer at the next communication partner, to create a virtual connection between them.

The physical layer transmits the data through some kind of medium (wire, air, fibre optic, . . . ) to the next node. After the transmission, the data passes the layers up. If the node is just an intermediate one the data moves up to the network layer, where it gets formatted for the transmission to the next node. If the data has arrived at its destination, it gets passed up to the application layer.

In the following sections each of the seven layers is briefly described. More information

(47)

4.1 Open Systems Interconnection Model

physical layer data link layer network layer transport layer session layer presentation layer application layer physical layer data link layer

network layer transport layer session layer presentation layer application layer Protocol

Network Stack Network Stack

physical transmission of bits

Data Data

Figure 4.2: graphical representation of the ISO/OSI Model

4.1.1 Application Layer

The application layer is the interface between a program or application running on aPE

and the communication infrastructure. It defines the interaction between two or more communication partners, such as how to request some data or how to send the partner data. For this interaction the application does not require any information about the underlying network, the destination address is enough. Very common application layer

protocols used in the Internet are Hypertext Transfer Protocol (HTTP)and Post Office

Protocol Version 3 (POP3).

4.1.2 Presentation Layer

Data can be presented in multiple forms. For example some processor cores use big endian or little endian byte encoding for working with structures bigger than one byte.

A higher level form ist the language encoding with ISO codes or UTF-8.

To allow the application layer to just use the passed data, the presentation layer converts and transforms the data to the required representation.

The presentation layer can be used to implement point to point encryption too.

4.1.3 Session Layer

A communication session consists of the connection establishment, the transmission and reception of multiple data and the detachment of the connection.