Virtualizing FPGAs for Cloud Computing Applications. Stuart A. Byma

(1)

Virtualizing FPGAs for Cloud Computing Applications

by

Stuart A. Byma

A thesis submitted in conformity with the requirements

for the degree of Master of Applied Science

Graduate Department of Electrical and Computer Engineering

University of Toronto

c

(2)

Abstract

Virtualizing FPGAs for Cloud Computing Applications Stuart A. Byma

Master of Applied Science

Graduate Department of Electrical and Computer Engineering University of Toronto

2014

Cloud computing has become a multi-billion dollar industry, and represents a computing paradigm where all resources are virtualized, flexible and scalable. Field Programmable Gate Arrays (FPGAs) have the potential to accelerate many cloud-based applications, but as of yet are not available as cloud resources because they are so different from the conventional microprocessors that virtual machines (VMs) are based on. This thesis presents a first attempt at virtualizing and integrating FPGAs into cloud computing sys-tems, making them available as generic cloud resources to end users. A novel architecture enabling this integration is presented and explored, and several custom hardware appli-cations are evaluated on a prototype system. These appliappli-cations show that Virtualized FPGA Resources can significantly outperform VMs in certain classes of common cloud computing applications, showing the potential to increase user compute power while reducing datacenter power consumption and operating costs.

(3)

Dedication

To Jennifer.

(4)

Acknowledgements

First I must sincerely thank my advisors Professors Greg Steffan and Paul Chow. I owe my success in graduate school to them and their invaluable guidance and advice. I could not have asked for better mentorship throughout my Masters research, or in this chapter of my life. Thank you both.

Also to my esteemed colleagues and office mates Xander Chin, Charles Lo, Ruedi Willenberg, Robert Heße, Fernando Martin Del Campo, Andrew Shorten, Jimmy Lin and others: You have made my time at the U of T a true pleasure – thank you for all the good times, and for being an ever present sounding board for thoughts and ideas.

A special thanks as well to members of the SAVI testbed: Professor Alberto Leon-Garcia, Hadi Bannazadeh, Thomas Lin and Hesam Rahimi. Your help and advice, tech-nical and otherwise, has made my work presented here possible.

Finally and most importantly, an everlasting thanks to my wife. Thank you for encouraging me to pursue my passions, and thank you for your unfaltering belief in me. None of this would have happened without you.

(5)

Acronyms

BRAM Block Random Access Memory. 27

CAD Computer Aided Design. 16, 24

CAM Content Addressable Memory. 17–19

DRAM Dynamic Random Access Memory. 19, 20, 26

FIFO First In First Out. 20

FPGA Field Programmable Gate Array. 1–4, 9–17, 19–22, 24

GPIO General-Purpose Input-Output. 18, 19, 27

LUT Look-up Table. 27

MAC Media Access Control. 17–20, 22

PR Partial Reconfiguration. 9, 15, 16, 19, 22, 24

PRM Partially Reconfigurable Module. 16, 17, 24, 30

PRR Partially Reconfigurable Region. 15–17, 22, 24, 25, 27–30

SAVI Smart Applications on Virtual Infrastructure. 5–9

TCP Transmission Control Protocol. 21 viii

(9)

UART Universal Asynchronous Receiver-Transmitter. 18

UUID Universal Unique Identifier. 21–23

VFR Virtualized FPGA Resource. 2, 3, 17–25

VM Virtual Machine. 2, 3, 17

(10)

List of Tables

4.1 Resource Usage for System Static Hardware . . . 37

5.1 Boot Times for VMs and VFRs . . . 45

5.2 Resource Usage for VFR Load Balancer . . . 49

5.3 Resource Usage for VFR VXLAN Port Firewall . . . 57

5.4 Throughput and Latency for VXLAN Port Firewall . . . 59

(11)

List of Figures

2.1 Diagram of the SAVI testbed. . . 7

2.2 The SAVI testbed Smart Edge node. . . 9

2.3 The Driver-Agent abstraction used in the SAVI testbed OpenStack system. 10 3.1 A simplified view of resource management in OpenStack/SAVI Testbed. . 15

3.2 FPGA Partial Reconfiguration. . . 22

3.3 System view of the on-FPGA portion of the virtualization hardware. . . 24

3.4 Virtualization hardware input arbiter block. . . 26

3.5 The VFR wrapper design. . . 27

3.6 VFR boot sequence. . . 32

3.7 Compile flow. . . 34

4.1 A sequence diagram of the entire boot procedure in the SAVI testbed prototype system. . . 43

5.1 Experiment setup for load balancer tests. . . 51

5.2 VFR load balancer latency at different throughput levels. . . 52

5.3 VM load balancer latency at different throughput levels. . . 52

5.4 Number of dropped packets for the VM load balancer. . . 53

5.5 Packet diagram for the VXLAN protocol. . . 55

5.6 VXLAN Port Firewall . . . 57

5.7 Experimental setups for VFR-based VXLAN firewall. . . 59

(12)

Chapter 1 Introduction

Datacenter-based cloud computing has evolved into a multi-billion dollar industry, with continued growth forecast [1]. Cloud computing is based on virtualization technology, which abstracts physical resources into virtualized resources. This virtualization provides flexibility and system scalability (or elasticity) [2], and also allows many users to share available resources in a datacenter in a transparent way. Cloud computing can also greatly reduce Information Technology (IT) operating costs of companies and organizations [3], making it a very attractive option for IT needs.

1.1 Motivation

Field-Programmable Gate Arrays (FPGAs) have the potential to accelerate many mon cloud computing and datacenter-centric applications, such as encryption [4], com-pression [5], or low-level packet processing [6]. FPGAs have begun to make their way into datacenters, and their use in this context can be organized into three categories. The first sees FPGAs being used in the technology that enables the datacenter itself, such as switches and routers. FPGAs in this category are transparent, as neither the end user nor datacenter operator are necessarily aware of their existence. In the second category, FPGAs are used in special “appliances” – essentially boxes that accelerate certain tasks

(13)

Chapter 1. Introduction 2

or processing. An example could be FPGA-based Memcached appliances [7]. The appli-ance may be available to end users, but the FPGAs inside are themselves not accessible, programmable resources – they are still relatively transparent. The third category, which is the focus of this thesis, sees FPGAs becoming fully user-accessible, programmable re-sources. Users would be able to allocate FPGA resources just as a virtual machine using the same control infrastructure – making FPGAs first-class citizens of the cloud.

Consider a motivating example: A large organization runs its Information Technology (IT) services and website on an infrastructure-as-a-service cloud, using hundreds or even thousands of VMs to serve their site and services to millions of users. Their applications may require compute-intensive processing, or application-level packet processing that re-quires many VMs to do efficiently. If user-accessible FPGA resources are available in the cloud, the organization could design custom hardware to accelerate these tasks – elimi-nating a number of VMs in exchange for a few FPGA resources, and potentially gaining a boost in throughput and a reduction in latency. At the same time, the user retains all the benefits of using a compute cloud, such as dynamic scalability, flexibility, and reliability. The cloud provider also benefits by freeing up VMs, which could potentially reduce power consumption and operating costs.

The work presented in this thesis aims to explore methods of enabling these FPGA resources in commercial cloud computing systems.

1.2 Contributions

A hardware/software architecture enabling the virtualization of FPGAs and manage-ment thereof using the OpenStack cloud system is presented in this thesis. The major contributions are outlined as follows:

• A hardware and software infrastructure that splits an FPGA into a number of recon-figurable regions, and allows these regions to be managed as individual resources in

(14)

Chapter 1. Introduction 3

an OpenStack cloud system. Introduction of the term Virtualized FPGA Resource (VFR).

• A functional implementation of such an architecture using the Smart Applications on Virtual Infrastructures testbed.

• An comparison of VFRs and VMs in terms of boot time performance.

• An evaluation and proof of concept of the prototype system by means of two ap-plications:

– A hardware load balancer using a hypothetical UDP-based protocol

– A method of using virtualized hardware to extend capabilities in an OpenFlow software-defined network (SDN).

1.3 Overview

The rest of this thesis is organized as follows: Chapter 2 will provide background and context – reviewing prior work in virtualizing FPGAs or work similar to the techniques used in this thesis. Chapter 3 will introduce the hardware and software architecture en-abling FPGA virtualization. Chapter 4 will introduce the SAVI [8] testbed prototype [9] and implementation details. Comparisons of VMs and VFRs, as well as evaluations of the proof of concept applications, are shown in Chapter 5. Chapter 6 provides some future vision and concludes the thesis.

(15)

Chapter 2 Background

This chapter introduces concepts and definitions used throughout this thesis, and provides context. Work related to the techniques used in this thesis will also be examined.

2.1 Field-Programmable Gate Arrays

This thesis focuses on the use of Field-Programmable Gate Arrays (FPGAs) in data-centers and cloud computing. An FPGA is a silicon chip whose functionality can be reprogrammed an arbitrary number of times to become nearly any digital circuit – it is a type of reconfigurable hardware. Modern FPGAs are typically made up of a large array of programmableLook-up Tables (LUTs), each of which can implement a four, five or six variable logic function, depending on the device architecture. LUTs are usually coupled with flip-flops and organized into logic blocks, that can then be connected to-gether through a dense, programmable routing fabric. For further reading on FPGA architectures, the reader is directed to [10]. A set of CAD tools can map arbitrary hard-ware designs described inHardware Description Languages (HDLs) to the FPGA fabric. The most common HDLs include Verilog HDL [11] and VHDL [12], but others such as BlueSpec Verilog [13] are gaining popularity.

Modern FPGAs also have embedded hard blocks to increase their capabilities – these

(16)

Chapter 2. Background 5

include Digital Signal Processor (DSP) or multiplier blocks, block Random Access Mem-ories (BRAMs), high speed serialize-deserialize (SERDES) transceivers, communication controllers (Ethernet [14], PCIe [15]), and even full microprocessors.

2.2 Cloud Computing

It is useful to define what is meant by the term cloud computing, as many companies, individuals and other sources may use the term in different ways. This thesis will fol-low the definition of cloud computing given by NIST that describes several essential characteristics [16], summarized here:

1. On Demand Self Service: ability to provision cloud resources in the cloud at any time, on demand, without interaction with humans.

2. Broad Network Access: all resources are available and accessible over the network. 3. Resource Pooling: provider resources are organized into pools enabling multi-tenant

service.

4. Rapid Elasticity: amount of resources can be dynamically increased or contracted. 5. Measured Service: ability to monitor, control and report resource usage.

In addition there are also several different cloud service models. The NIST definition above covers all these models, and they are described briefly here:

• Software as a Service (SaaS): Allocatable resources are software programs, usually provided over the Internet via web browsers.

• Platform as a Service (Paas): Resources are operating systems, development tools and frameworks for creating software and services.

• Infrastructure as a Service (Iaas): Resources are virtualized datacenter components such as Virtual Machines (VMs), virtual storage, networking, bandwidth.

(17)

This thesis will focus primarily on IaaS type cloud computing, where allocatable resources are virtualized datacenter components. A good example of this type of cloud computing would be Amazon Web Services [17], or the SAVI testbed, which will be discussed shortly.

The cloud computing paradigm has become immensely popular in the IT services and related industries because it frees organizations from the physical aspects of computing and IT infrastructures. There is no major capital investment required for physical servers and networking equipment, nor maintenance costs on said equipment. These burdens are shifted to the Cloud Provider, and the IT organization simply pays a set rate for the cloud-based resources that it uses. The fact that the organization pays only for what it uses, combined with the lack of capital investment represents a significant cost reduction to the organization. The cloud generally guarantees a Service Level Agreement (SLA), leaving the organization assured that its IT infrastructure will experience little to zero downtime due to hardware problems. Cloud computing also allows the end user to scale their systems up or down seamlessly, avoiding the need to over-provision computing capabilities or bandwidth usually needed to mitigate the effects of bursty traffic, again saving on operating costs. From a technical and cost perspective, cloud computing is generally extremely attractive to organizations with both large and small IT needs. Certain other factors may influence the attractiveness of cloud computing, usually legal issues arising from the geographic location of the cloud providers datacenter, or privacy concerns due to the fact that a user’s data is effectively in the hands of a third party, however these points are outside the scope of this thesis.

(18)

2.3 The Smart Applications on Virtual

Infrastruc-ture Network

SAVI [8] is a Canada-wide research network aimed at exploring next-generation appli-cation marketplaces that make use of fully virtualized infrastructure, as well as future Internet alternatives. A central vision of the SAVI network is the notion of a Smart Edge node – a smaller-scale datacenter situated close to the network edge, providing specialized low-latency processing for future application platforms. SAVI joins a number of other networking research testbeds such as GENI [18], Emulab [19], PlanetLab [20] and Internet2 [21], many of which are also federated.

C & M

U of T Edge U of T Core

C & M

ORION

CANARIE _CANARIE

SAVI Testbed Network

C & M McGill Edge C & M Carlton Edge C & M Victoria Edge C & M Calgary Edge C & M Waterloo Edge C & M YorkU Edge Virtual Network Virtual Network Application X Resources Application Y Resources C & M

Figure 2.1: Diagram of the SAVI testbed.

The SAVI testbed is one of the SAVI network research themes. The goal of the SAVI testbed is to realize a future application platform that will provide a testing ground for other SAVI research themes. The testbed consists of several Core nodes and many Smart Edge nodes, deployed at various Universities and institutions across Canada. These Core and Edge nodes are interconnected by a fully virtualized Software Defined Network (SDN), and the whole system is orchestrated by a Control and Management (C & M) system. Users can allocate virtualized resources via the C & M system across all nodes in the testbed, as well as private virtual networks that provide complete isolation from other

(19)

users experiments and systems. The ORION [22] and CANARIE [23] networks connect all components over a large geographic area of Canada. Figure 2.1 shows a diagram of the testbed architecture.

2.3.1 The Smart Edge Node

SAVI Smart Edge nodes are small-scale datacenters situated close to the edge of the network and are the primary connection point for application users. SAVI Smart Edge nodes are unique in that they make use of heterogeneous resources in addition to virtual machines to accelerate processing – GPUs and reconfigurable hardware, as well as regular bare-metal servers. These resources put a large amount of processing power close to the edge of the network, and allow applications to do a majority of intensive processing before having to traverse the possibly high-latency network to the Core datacenter. Such intensive processing may include things like advanced signal processing for wirelessly connected devices, encryption and decryption, multimedia streaming acceleration, new types of switching and routing, and other packet-oriented processing.

Figure 2.2 shows a diagram of a Smart Edge node. In the SAVI testbed, the Smart Edge is an OpenStack cloud system [24]. OpenStack management forms the Smart Edge C & M plane through a number of subsystems, all of which are reachable through RESTful [25] APIs.

• Nova [26] – manages all compute resources through a Driver-Agent abstraction.

• Keystone [27] – performs authentication and identity management.

• Glance [28] – manages Virtual Machine images and other images.

• Quantum [29] – performs network management functions. Due to naming trade-marks, will become known as “Neutron”.

(20)

• Cinder [31] – block storage service.

The SAVI smart edge also has a custom Software-Defined Infrastructure (SDI) man-ager, calledJanus. Janus offloads certain tasks from OpenStack, such as network control and resource scheduling, and also performs configuration management and orchestration of the testbed’s OpenFlow-based Software-Defined Network (SDN). Network control is accomplished through an OpenFlow Controller implemented using Ryu [32]. Janus also virtualizes the network into slices using FlowVisor [33] (an OpenFlow-based network vir-tualization layer), and users can run their own User OpenFlow Controller to manage their own private network slice. Essentially, the SDI Manager brings together Cloud Computing and Software Defined Networking together under one management system. More information on SAVI testbed infrastructure management and Janus is provided in [34].

Smart Edge

Application and Service Provider Keystone Glance-reg

Nova

Driver-1 Driver-2 Driver-N

Agent Agent Agent Agent Agent Agent

Sliceable OpenFlow Network

Heterogeneous Resources SDI Manager Janus Conﬁguration Manager OpenFlow Controller (ryu) FlowVisor Cinder Glance-API Swift Quantum OpenStack User Open Flow Contr oller

Figure 2.2: The SAVI testbed Smart Edge node.

Of particular interest to this thesis in Figure 2.2 is the Nova component of OpenStack, which is the part that allocates resources. The standard Nova only supports processor

(21)

virtualization, where Virtual Machines (VMs) are booted on top of hypervisors that abstract away the physical hardware. The vision of the Smart Edge however, incorporates Heterogeneous Resources in addition to VMs. Thus Nova in the SAVI testbed is extended to enable it to manage these new resources.

2.3.2 Heterogeneous Resources in the SAVI Testbed

For OpenStack to manage different types of resources, they must all appear homoge-neous in nature. To accomplish this, the SAVI testbed OpenStack uses a Driver-Agent system. A driver for any resource implements required OpenStack management API methods, such as boot, reboot, start, stop and release. The driver then communicates these OpenStack management commands to an Agent, which carries them out directly on the resource, via a hypervisor or otherwise. In this fashion, OpenStack can manage all resources through the same interface. Figure 2.3 shows a diagram of the Driver-Agent system. Essentially, the Agent is performing resource-specific management, while the driver facilitates resource-agnostic management for OpenStack. The method of commu-nication between the Agent and the Resource, and the Driver and the Agent, is entirely resource-dependent. Common API OpenStack Nova Driver Agent N _Resources Agent 1 _Resources Communication Resource-Speciﬁc Management Resource-Agnostic Management ...

Figure 2.3: TheDriver-Agent abstraction used in the SAVI testbed OpenStack system.

If a user desires to allocate a resource, they need to be able to specify what resource type they want - the SAVI testbed extends the OpenStack notion of resource flavor to en-able this. Usually, resource flavor refers to the number of virtual processors and amount of RAM to allocate to a VM. The SAVI testbed extends the definition of flavor to also

(22)

include resourcetype. The SAVI testbed currently has several of these additional resource types including GPUs, bare-metal servers, and reconfigurable hardware. Although recon-figurable hardware is included in SAVI (and its precursor VANI [35]) it is still relatively non-virtualized – simply FPGA cards in bare-metal servers managed by OpenStack.

To be made aware of their existence, OpenStack must have resource references placed in its database – one for each allocatable resource. This is done using the nova-manage

tool. The resource database entry includes the address of the Agent that provides the resource, a type name that can be associated with a flavor, and how many physical network interfaces the resource has. A flavor is created for each unique resource type.

2.4 FPGA Virtualization

This thesis focuses on exploring methods of virtualizing FPGAs and managing the device or portions thereof using OpenStack in the SAVI testbed. A number of prior works have examined “virtualization” of FPGAs in different contexts, described in the following subsection.

2.4.1 Related Work

Hardware virtualization, especially that pertaining to FPGAs, has been explored for some time. Initially realized through time multiplexing hardware [36], most hardware virtualization schemes now generally use run-time reconfiguration of the FPGA. This can be full reconfiguration of the entire device, but usually refers to dynamic Partial Reconfiguration (PR) of a portion of the FPGA.

In terms of network applications, which is a theme in this thesis, there has been work examining the use of partial reconfiguration to virtualize forwarding data planes in routers [37], although this is a very specific case and does not involve user-designed custom hardware. Recent works involving virtualized FPGAs for custom user hardware virtually

(23)

increase the number of available FPGA partially reconfigurable resources [38] or virtualize non-PR coprocessors [39], to maintain parity with the number of microprocessors in a high performance computing environment. Others use the partial reconfiguration technique to make reconfigurable hardware sharable by multiple software processes. This generally involves some sort of virtualization on the level of the operating system in addition to the FPGA or gate-level virtualization done using PR. Some works investigate operating system and scheduler design specifically to manage reconfigurable hardware tasks [40, 41]. On a lower level, Huang et al. use a hot plugin technique to provide access to PR based accelerators via a unified Linux kernel module, allowing multiple processes to efficiently share different accelerators [42]. Pham et al. propose a microkernel hypervisor for new FPGA/CPU hybrid devices, which facilitates access to either a CGRA-like Intermediate Fabric or a regular PR region running user accelerators [43].

What all these schemes and others like them have in common is that they view the reconfigurable accelerators as rather short-lived entities executing hardware “tasks”, which supplement software tasks running on a conventional processor. Thus they focus heavily on reconfiguration times and concurrent access to the same PR region for multiple processes, as well as high bandwidth between the FPGA and CPU.

The context of FPGA virtualization in this thesis is markedly different. This thesis does not assume the virtualized accelerators to be closely coupled with CPUs or software processes, rather, the accelerators are seen as being a major or supplemental component of massive, distributed, cloud-based infrastructures. Most virtualization techniques like the ones mentioned above cannot readily be applied to IaaS clouds and VMs because the end user does not have any sort of access to the underlying physical hardware. It may be possible for a cloud provider to make so called hardware tasks available to VMs and thus end users, but the users would likely be unable to define their own hardware because of the low-level access it would still require, which somewhat defeats the purpose.

(24)

envisions streaming, packet processing, and network centric applications as well – all things that the user of a virtualized datacenter may need. Because of this, the work presented here focuses on providing virtualized, in-network hardware resources that are analogous in the “resource” sense to VMs.

(25)

Chapter 3 Architecture for FPGA

Virtualization

The general approach for virtualization is modelled after that for virtual machines. To do this, it is important to understand how OpenStack manages resources. This chapter will briefly examine how OpenStack operates and manages heterogeneous resources in the SAVI Testbed. Then, the architecture of the system enabling FPGA virtualization is presented.

3.1 OpenStack Resource Management

Figure 3.1 shows a simplified diagram of resource management in OpenStack. The Open-Stack Controller runs on a Commodity Server inside the SAVI Testbed, and provides an API (specificallyNova) to allow a user to request resources. In general, VM resources in the cloud system are booted on top of hypervisors on physically separate machines from the Controller - OpenStack maintains a database of all resources in the system, both in use and free for allocation.

When a resource request comes in, OpenStack finds available resources in the Re-source Database, and finds which physical machine they are located on. The Controller

(26)

Chapter 3. Architecture for FPGA Virtualization 15

Commodity Server

SAVI TB OpenStack Controller

Resource

Database

Resource Server

Agent

Hypervisor VM Rsrce

...

API

(Nova)

User

VM Rsrce RsrceVM Physical Server Resource GPU Server Resource *-Drivers SAVI TB Heterogeneous Resources (Commodity Server)

Figure 3.1: A simplified view of resource management in OpenStack/SAVI Testbed.

communicates with the Resource Server via a separate process running beside the Hyper-visor, called an Agent. The Agent is a piece of software that interprets commands from the main OpenStack Controller. As described briefly in Chapter 2, the Agent is part of aDriver-Agent abstraction – a Driver integrates with the OpenStack compute controller (Nova), implementing resource control API functions. Through the Driver, OpenStack requests the resources from the Agent, which in turn instructs the Hypervisor to boot a VM with the operating system image and parameters specified by the User (sent to the Agent by the Controller). Networking information is also sent by the Controller, which is used by the Agent and Hypervisor to set up network access for the VM. A reference, usually in the form of an IP address, is returned to the User such that they can connect to their resource, and run whatever application or system they want on top of it. For more details on OpenStack, the reader is referred to [24].

(27)

Figure 3.1 also shows that the SAVI Testbed OpenStack Controller manages het-erogeneous resources through similar mechanisms. A custom driver communicates with Agents managing bare-metal servers, with some being regular bare-metal servers, and others containing GPUs or reconfigurable hardware.

The current state of reconfigurable hardware in the SAVI Testbed is relatively non-virtualized – bare-metal servers with PCIe FPGA cards and several BEECube systems. They are non-virtualized because the resource provided is a fully physical resource, not sharable between Users and thus not very scalable or flexible. The current reconfigurable resources in the SAVI Testbed are described briefly below.

3.1.1 SAVI Testbed FPGA Resources

This subsection briefly describes the current, non-virtualized FPGA resources in the SAVI Testbed.

BEE2 Boards

The SAVI Testbed has a number of BEE2 systems [44]. The BEE2 is equipped with five Xilinx FPGAs, with one used to control the others. In the Testbed, an Agent runs on an embedded system on the control FPGA, and manages the other FPGAs as resources that can be allocated. Each FPGA resource has four 10G-capable CX4 interfaces that connect to the Testbed SDN, allowing the user to send and receive data from their hardware on the FPGA.

Since the user simply gets the entire device as a resource, they are responsible for designing and compiling their hardware using vendor tools, ensuring that their hardware ports match the correct pin locations on the BEE2, and ensuring that the hardware will function correctly. Once they generate a bitstream file for programming the FPGA, it is uploaded to OpenStack as an image, and would be loaded on the FPGA by the Agent.

(28)

an “image” refers only to an Operating System (OS) image, however OpenStack allows any file type to be uploaded as an image. Therefore, for a BEE2 FPGA resource, the im-age will be a bitstream generated by the FPGA tools. For the BEE2 resource, the Agent will receive this image from the OpenStack Controller via the driver, and simply config-ures it onto an unused FPGA. OpenStack sees the FPGA as any other resource thanks to the Driver-Agent abstraction, and the user can now make use of custom hardware acceleration in the SAVI Testbed.

PCIe-Based FPGA Cards

To increase the range of different FPGA applications available to researchers, it is useful to have FPGAs closely coupled to processors so that the reconfigurable hardware can accelerate compute-intensive portions of software. The SAVI Testbed provides several PCI-Express-based FPGA boards connected to physical servers: The NetFPGA, the NetFPGA10G [45] and the DE5Net [46]. The boards have varying FPGA device sizes and on-board memory, but have in common four network interfaces that are connected to the Testbed SDN. The NetFPGA has four 1G Ethernet ports, while the NetFPGA10G and DE5Net have four 10G Ethernet ports. A researcher can now design custom hardware that can accelerate software tasks, provide line-rate packet processing, or a combination of both.

In addition to these boards, the Testbed also contains MiniBEE [47] resources. The MiniBEE contains a conventional processor and an on-board FPGA connected through PCIe. It also has 10G network interfaces, a large amount of memory and an expansion port for additional FPGA peripherals.

Since the PCIe boards are required to be mounted inside physical servers, the SAVI Testbed provides the server itself with the FPGA card attached as a resource. In the case of the MiniBEE, the entire system is also offered as a resource.

(29)

3.1.2 Requirements for FPGA Virtualization in OpenStack

Using Virtual Machines as a model for full FPGA virtualization, it is clear there are several required components: An agent to provide the FPGA (or pieces thereof) as a resource, and a driver to integrate into OpenStack so that OpenStack can communicate with the Agent. The Agent is responsible for managing the actual resource provided to the user, in this case an FPGA or portion thereof, and therefore must be capable of receiving an FPGA programming file (hereafter referred to as a “bitstream”) and configuring or reconfiguring the device. It must also track which FPGAs or FPGA portions have user hardware running in them, and which hardware belongs to which user.

Additionally, if full virtualization is to be achieved, the physical FPGA device must be abstracted and sharable between different users. This will require a base hardware architecture to “virtualize” the device, somewhat similar to a hypervisor. The Agent must be aware of this virtualization layer to manage the resources as well.

The following sections will describe the design of a system that meets the afore-mentioned requirements, using OpenStack in the SAVI Testbed as the cloud computing platform. The base hardware architecture virtualizing the device will be described, and then the Agent that provides the virtualized hardware to OpenStack.

3.2 Hardware Architecture

Though the current FPGA resources in the SAVI Testbed are managed by OpenStack, they still leave much to be desired in terms of commercial systems and user-friendliness. The resources are still relatively non-virtualized – a single physical device is allocated to one user, whereas in a fully virtualized system, one physical device should be sharable among different users simultaneously. There is more motivation for this when considering that one user may not make fully use of an entire FPGA, wasting some reconfigurable fabric that could go to another user. A full FPGA can be more difficult to design

(30)

and program, especially when integrating complex IP (such as memory controllers), and without physical access to the device.

The architecture presented in this thesis seeks to resolve these problems through full hardware virtualization.

3.2.1 Fully Virtualized Hardware

Virtualization in a cloud computing context has several characteristics that are usually presented in terms of Virtual Machines:

• Physical Abstraction – The physical device itself is abstracted and the user is not aware of the underlying hardware. For example, a VM may be running via a hypervisor running on an Intel Xeon processor, however the user is only aware of how many Virtual CPUs they have. They are unaware of the real hardware.

• Sharing – A single physical device provides one or more virtual instances to one or more users. Such devices can also be referred to asMulti-tenant devices.

• Illusion of Infinity – The actual number and physical location of resources is also abstracted, and from the user’s view there exists a seemingly infinite pool of re-sources.

The objective of the work presented in this thesis is to enable fully virtualized FPGA hardware by designing an architecture and system that has the above characteristics (or characteristics that are analogous). The physical device should be abstracted – a user should be able to specify a hardware design, in HDL for example, and rely on the system to run their design in the cloud. They should not have to worry about what specific device their hardware must run on, nor about compiling for different devices. They should also not be aware of other physical aspects of the system – such as the physical location of resources or the number of available resources (i.e. the illusion of infinite resources in the cloud should be maintained.)

(31)

FPGA size and density has grown to the point where it is feasible for the device to be virtualized and shared between users, with enough fabric leftover for each user to run non-trivial hardware designs. This also has the benefit of improving the usage efficiency and cost-effectiveness of a device, since many useful hardware designs do not need as much logic as the entire device provides. Full hardware virtualization would allow the cloud provider to put the unused fabric to work for other users.

Another issue fully virtualized hardware would solve is that of security. Giving a user full control over an entire FPGA directly connected to the network may be risky for a cloud provider. It would allow a nefarious user the ability to inject malicious data directly into the provider datacenter, at extremely high rates (10Gb/s or more). A hypervisor, in the case of a regular VM, acts as a buffer between the user’s guest OS and the provider hardware, allowing the provider to set up security and police network traffic before it gets onto the internal network. The hardware virtualization layer is therefore designed to allow the provider to police the data going in and coming out of the user hardware.

The general approach for virtualization of the hardware is based on Partial Reconfig-uration (PR) of the FPGA. This technique of reconfiguring specific portions of an FPGA while the rest remains running can be used to effectively split the device into several regions that can be offered individually as resources to cloud users. To familiarize the reader, the basics of Partial Reconfiguration will be reviewed in the following section.

3.2.2 FPGA Partial Reconfiguration

Partial Reconfiguration is a capability of some FPGAs where portions of the device can be reconfigured independently, without affecting other circuits running on the device. Physical portions of the device must be specified to be aPartially Reconfigurable Region (PRR), and specific hardware modules of the overall design must be mapped to one of these PRRs. Multiple modules can be compiled for one PRR, however only one can be configured at run time. These modules are called Partially Reconfigurable Modules

(32)

(PRMs). Generally, the logic surrounding the PRRs is fixed, and is referred to as the

static logic. Major FPGA vendors support partial reconfiguration [48, 49, 50].

Figure 3.2 depicts a partially reconfigurable FPGA system. There is one PRR, and three PRMs (PRM A, B and C). Each PRM contains different hardware implementing different functionality, and each PRM can be dynamically configured into the PRR at run-time while the Static Logic remains running.

PR introduces complexities into the hardware compilation process. The interface from the static logic to a PRR, called the PR Boundary, must be dealt with carefully by the CAD compile process. The static logic must be compiled once, since it does not change, along with one of the PRMs. After placement and routing, the physical wires crossing the PR boundary are set permanently since they connect to the static logic, shown in Figure 3.2 as Static Connection Points. Further PRMs compiled with the static logic

must have their connections routed to the same physical locations, so that when they are partially reconfigured, their connections actually connect to the running static logic. This is also shown in Figure 3.2, where any PR Boundary crossing signal is routed to the same location in each PRM. This is usually accomplished by locking the placement of the logic cells whose wires cross the boundary (called anchor LUTs) after compiling the static logic. Obviously, every PRM for a given PRR must have the same logical top-level ports, whether or not they use them all.

Other considerations must be made by designers using PR. During reconfiguration, outputs of a PRR may be in flux and have unknown values. The static logic should have a method of freezing these outputs or ignoring them while reconfiguration takes place. Timing constraints can also be harder to meet in PR systems, since the CAD tools are unable to perform any logic optimization across the PR boundary. Xilinx Inc. suggests registering signals both before and after the PR boundary to improve timing performance [48]. Lastly, the designer should ensure that a freshly reconfigured PRM is fully reset to a known state.

(33)

Chapter 3. Architecture for FPGA Virtualization 22 PRM C PRM A PRM B

Static Logic

FPGA

PRR PR Boundary

Static Connection Point

Figure 3.2: FPGA Partial Reconfiguration.

3.2.3 Virtualization via PR

In a cloud computing context, the static logic surrounding the Partially Reconfigurable Regions implements hypervisor functions – providing a buffer under control of the cloud provider in between the network and the user-defined hardware in the PRRs. Just as in a VM hypervisor, this will allow the cloud provider to implement some measure of security, and possibly other required management functions. The static logic also has several other functions that will become apparent as the full design is described below.

3.2.4 Static Logic Design

As mentioned previously, partial reconfiguration is used to split a single FPGA into several reconfigurable regions, each of which are managed as a single Virtualized FPGA Resource (VFR). In effect, this virtualizes the FPGA and makes it a multi-tenant device, although still requiring the external control of the Agent. A user can now allocate one of these VFRs and have their own custom designed hardware placed within it. The static logic surrounding the VFRs is still under control of the cloud provider, and must

(34)

accomplish several functions. The method of data transfer between a user and their VFR is over the network, and therefore the static logic must facilitate forwarding of packets to the correct VFR. To do this, the static logic system must track the network information (i.e. MAC addresses) of each VFR as provided by the OpenStack Controller. The static logic is also designed in a way that enables it, and thus the cloud provider, to police the interfaces to the VFRs, to maintain some basic network security, such as prevention of sniffing traffic and spoofing addresses. The static logic contains interface hardware for for 10G Ethernet ports, memory controllers, all chip level I/O, and a method of communicating with the Agent. The following paragraphs and subsections describe the design choices made for this thesis to accomplish these functions.

Figure 3.3 shows a block diagram of the on-FPGA portion of the system. A Soft Processor (that is, a microprocessor implemented inside the FPGA fabric) communicates with the Agent that runs on the Host machine. The Soft Processor is attached to a Bus that allows it to communicate with and control the different components of the system. A bus is a two-way communication system with two actors: Masters and Slaves. Masters can initiate read and write transactions by addressing the Slave they wish to communicate with, while Slaves can only respond to reads or writes. The Soft Processor is a Bus Master. The DRAM Controller is a Bus Slave that facilitates access to Off-Chip DRAM. The MAC Memory-mapped (Memmap) Registers are also Slaves that allow the Soft Processor to control the Input Arbiter and Output Queues (described in the following subsection). The VFRs are wrapped in Bus Masters, which allows them to access the DRAM Controller slave. Packet streams from 10G interfaces pass through the Input Arbiter and the VFR Bus Master wrappers and enter the VFRs. Output streams exit the VFR and connect to the Output Queues and subsequently the egress interfaces. The Agent can reprogram the entire chip or individual VFRs through an external Programmer that operates over JTAG. The major subcomponents shown in Figure 3.3 are described in the following subsections.

(35)

Chapter 3. Architecture for FPGA Virtualization 24 Virtex 5 VTX240T Soft Processor Input Arbiter Output Queues MAC Memmap Registers VFR VFR VFR Agent (Host) Programmer (JTAG) Oﬀ-Chip DRAM 128 MB DRAM Controller 1 2 N Stream In 256b Wide From 10GE (160 MHz) Stream Out To 10GE 32-bit Bus VFR Wrappers

Figure 3.3: System view of the on-FPGA portion of the virtualization hardware.

Data Transfer and I/O

A Streaming Interface facilitates packet transfer into the system from 10G Ethernet ports. Streaming Interfaces are one-way communication channels. They consist at a minimum of four signals:

1. A variable bit width Data signal. 2. A single bit Valid signal.

3. A single bit Ready signal. 4. A Clock signal.

The actual data of the packet is transferred in chunks through the Data field, where each chunk is the bit width of the Data field. These chunks are referred to as flits. At the positive edge of the Clock signal (that is, a logic low to logic high transition), one flit is transferred if and only if both the Valid and the Ready signals are logic high. The Valid signal is asserted by the sender along with the flit in the Data field, and the Ready

(36)

signal is asserted by the receiver to indicate it is ready to receive data. The Valid and Ready signals are also referred to as handshake signals. One flit can be transferred over the Data signal every clock cycle.

The Input Arbiter block, shown in Figure 3.4, is responsible for directing incoming packets to the correct VFR. The Input Arbiter contains a Content Addressable Memory (CAM) that the arbiter uses to match a packet’s Destination MAC address to a specific VFR. An incoming packet is stalled for one clock cycle while the CAM looks up the destination VFR. In the case that there is no matching VFR in the CAM, the packet is simply dropped. This has the benefit of preventing VFRs from inadvertently (or intentionally) sniffing Ethernet traffic that does not belong to them, but the drawback of being unable to receive broadcast packets. This could be addressed by designing a more complex switching fabric within the Input Arbiter.

The CAM must be programmed with the VFR MAC addresses (provided by Open-Stack) before any packets can be received. MAC addresses for new VFRs are programmed into the CAM by a soft processor via several memory mapped registers. The software running on the soft processor receives the MAC address and corresponding VFR region code over a UART link from the Agent running on the host. The software running on the processor and the communication protocol with the Agent will described in more detail in Section 3.3.

The output queues operate similarly, except this block simply tracks each VFR’s MAC in a register, and prevents spoofing by forcing an outgoing packet’s source MAC address to be the VFR’s MAC address. The output queue MAC addresses are also updated via the same memory mapped registers as the input arbiter.

Virtualized Accelerator Wrappers

The VFRs are wrapped inside Bus Masters, as can be seen in Figure 3.3. These are labelled 1 to N. Figure 3.5 shows a closer view of this wrapper design, which must

(37)

48 Bit CAM

MAC Memmap Registers VFR Region Code Out

write MAC Addr VFR Region Code

Stream In

Bus

To VFRs

Input Arbiter

Figure 3.4: Virtualization hardware input arbiter block.

accomplish two things – facilitate safe partial reconfiguration of a VFR, and provide the VFR access to off-chip DRAM. Safe reconfiguration is of paramount importance, since the cloud provider should seek to guarantee that user hardware will be configured correctly. The system must ensure that no information or packet is being transferred across the PR boundary during reconfiguration. This can result in lost data, or a newly reconfigured VFR receiving data starting in the middle of a packet, which could cause errors in the user hardware.

The VFR reconfiguration process is handled as follows: When a request comes to the Agent to configure a new VFR, the Agent sends a command to the soft proces-sor to freeze the interfaces of the selected region. The soft procesproces-sor will de-assert the PR INTERFACE ENABLE signal by writing to the register, which causes the wrapper hardware to set all streaming interface handshake signals low after any current transfer finishes. This is accomplished by using a register for each Stream Interface that records whether or not the stream is in the middle of a transaction. When both transaction reg-isters are logic 0, the hardware sends an ACK to the processor through another

(38)

memory-Chapter 3. Architecture for FPGA Virtualization 27

VFR

(User IP)

PR_INTERFACE_ENABLE ACK Stream In Bus Memmap Register

Memory Operation Queues

PR_RESET

Stream Out

ACK

Figure 3.5: The VFR wrapper design.

mapped register that the processor is polling, and the processor asserts the PR RESET signal and notifies the Agent that it is now safe to reconfigure that region. The VFR is held in reset, and after the new user hardware is configured via an external JTAG connection and the MAC address programmed into the input arbiter and output queue, the Agent instructs the soft processor to release PR RESET and enable the interfaces again. This method also ensures that new user hardware is fully reset before interfaces are enables.

The wrapper also facilitates access to low-latency off-chip DRAM for a VFR. The VFR wrapper has read and write ports so that user hardware can insert memory operations into a queue that acts as a bus master. The queue would be implemented as a FIFO so that all writes added to the queue are done in order, and read data is also returned in order. The queue, which is part of the wrapper hardware under provider control, partitions the memory address space so that each VFR in the physical FPGA only sees a subset of the total DRAM – effectively giving each VFR a private off-chip memory, while making it impossible for any VFR to access another’s memory. This is done by dividing

(39)

the address space and offsetting addresses by an appropriate amount before entry into the operation queue. For example, if the memory is 64MB then the memory address space might be 0x03FFFFFF <-> 0x00000000 or a total number of address locations of 0x04000000. If there are four VFRs then this space is divided into four, and each VFR would have log₂(0x01000000) = 26 address bits, or a range of 0x00FFFFFF <-> 0x00000000 (16MB each). When a VFR makes a read or write operation, the 26 bits are zero-extended to 32 bits (the system bus width) and offset an appropriate amount by adding a multiple of 0x01000000. The offset is done outside the VFR and inside the provider-controlled static logic, making the user hardware entirely unaware that memory beyond its partition exists. The queue is reset and cleared upon VFR reconfiguration via the PR RESET signal.

In the prototype system using the NetFPGA10G, the RLDRAM modules have a minimum read latency of three clock cycles, and a minimum write latency of four clock cycles, plus several cycles for the controller running on the FPGA. The module can burst read and write up to a length of eight. Maximum aggregate bandwidth is 38.4 Gb/s.

This memory system is admittedly rudimentary compared to other possibilities such as a multi-port memory controller, but it was chosen to keep the static logic simple and low-area.

3.3 Agent Design

This section describes the Agent – the piece of software that communicates with Open-Stack, performing resource specific management. One Agent manages all the VFRs on one FPGA, although the design could easily be extended to manage VFRs on multiple FPGAs. This section will describe Agent requirements in general, but also focus on the Agent implemented in the SAVI Testbed prototype, which uses the NetFPGA10G card. Generally, the Agent must implement the resource-specific management commands

(40)

from OpenStack (issued via the Driver). At the very least, these must include boot

(instantiate a new resource with the specified parameters) anddelete(tear down a running resource).

The Agent and embedded software use a simple command-acknowledge protocol to communicate: the Agent will send a command string, and the embedded software will respond with an acknowledge, and an additional acknowledge for each data parameter sent as part of the command. If no acknowledge is received, the command is aborted and re-attempted. In the prototype hardware, the embedded software is run on a Xilinx Microblaze soft processor. The Agent and OpenStack Driver communicate over the Testbed network using a text-based protocol over TCP .

The following subsections describe how the Agent boots and deletes VFRs.

3.3.1 Booting

When a boot command is issued in OpenStack, several pieces of information are received by the Agent:

• UUID – a universal unique identifier for the resource.

• Network information – an IP and MAC address for the resource.

• Image – usually an OS image for VMs, but repurposed for VFRs.

The Agent uses the UUID to track VFRs – UUIDs in OpenStack are an absolute reference for any object or resource. The image is the most important piece of data received. It contains PR bitstreams corresponding to the user-designed hardware. How this is compiled and created will be explained in Section 3.5. The image contains one PR bitstream corresponding to each physical PRR on the FPGA, numbered so that the Agent can tell which one is for which region. This is explained in detail in Section 4.2.1. The Agent chooses the first available unconfigured PRR for the incoming VFR to be booted, and then begins the reconfiguration process – note that this requires the Agent

(41)

to maintain the current state of the system, remembering what regions are currently configured and what users they belong to. A simple data structure could be used to store the state of each physical VFR and, if configured, the associated UUID.

First, a “disable interfaces” command is sent to the embedded software. This com-mand has a single parameter, the region code corresponding to the PRR about to recon-figured. The soft processor freezes the packet stream interfaces as described in Section 3.2.3, and places the PRR (VFR) into reset. The Agent then reconfigures the PRR using an external reconfiguration tool, in the case of the prototype system, Xilinx iMPACT over JTAG. Then, the MAC address is programmed into to the static logic’s input ar-biter and output queues, which allow packets in and out of the newly configured VFR. The “load MAC address” command is sent to the embedded software followed by the six byte MAC address received from the OpenStack Controller. Once successful, the “enable interfaces” command is issued, and the embedded software releases the VFR from reset and enables the packet stream interfaces.

This method ensures that no packet is in the middle of transfer during a reconfigura-tion, avoiding the situation where a piece of user hardware might receive a fragmented packet. It also ensures that user hardware is fully reset before running.

3.3.2 Deleting

Deleting a VFR is similar to booting in terms of what the Agent must do, however the only piece of data sent is a UUID. The Agent finds the VFR with the corresponding UUID, and proceeds to reconfigure the PRR with a black box bitstream. This effectively removes any user hardware from the system. Again this requires that the Agent store the state of each VFR and associated UUIDs.

(42)

3.4 Booting VFRs in OpenStack

Several modifications were needed in the OpenStack Controller (Nova) to get a complete working system. Major modifications were already made by the SAVI Testbed team to enable management of bare-metal servers – specifically modifications that allowed the integration of custom Drivers and therefore custom Agents.

Recall that in Chapter 2 the notion of resourceflavor was discussed. Normally, flavor refers to the number of virtual processors and amount of memory for a VM, but it can be repurposed to refer to more of a resource type. This is important because the Nova

scheduler uses the flavor submitted by the user to select which Driver to use to contact the Agent. Each resource in the database references a flavor, so many single resources can fall under one flavor.

Figure 3.6 shows a diagram of the boot sequence for a VFR, which proceeds as follows: Upon receiving the Boot command from the User, the OpenStack Controller uses the specified flavor to choose a resource in the database. The database entry references the Agent associated with it through an IP and port number. Nova calls the Driver implementation of the “boot” function, and passes generated networking information, the Agent IP and port, UUID and any other required information. The Driver then communicates with the Agent (FPGA host), instructing it to boot a new resource, and in the case of VFRs, sending over networking information and the image containing partial bitstreams. The Agent selects the partial bitstream in the image matching a free region, and programs it along with the network information as in Section 3.3.1. Success is indicated to the Controller, which then passes a reference to the user.

3.5 Compiling Custom Hardware

Through the modifications and systems described in this chapter, custom hardware ac-celeration is now available to cloud computing users. However, there is still the question

(43)

Chapter 3. Architecture for FPGA Virtualization 32 OpenStack Controller Boot (User) Agent (FPGA host) disable VFR interfaces, conﬁgure region program VFR MAC Agent selects PR bitstream matching free region .bit .bit Controller passes image, network info enable VFR interfaces .bit

Figure 3.6: VFR boot sequence.

of how to compile hardware for the system. The VFRs have specific interfaces that any user hardware must match exactly – not only logically, but also physically, due to the nature of PR discussed previously. The static logic remains constantly running while user hardware is partially reconfigured as resources are booted and deleted. Therefore, any new user hardware must be compiled with the currently running static logic that is part of the cloud provider’s systems. A compile flow must be developed to allow end users to compile their custom hardware for use with the virtualization system.

First, the user hardware top-level ports must match those of the static logic PR boundary logically – a template HDL file is provided to end users inside which they can define their hardware. The template file contains a module definition whose top-level ports match the PR boundary in the static logic.

It is assumed at this point that the provider has already synthesized, placed and routed the static logic. Any user hardware has to be compiled with the placed and routed static logic, so that the physical wires crossing the PR boundary are placed in the right locations in the PRM. How this is done in practice depends entirely on the CAD tool flow provided by the FPGA vendor. For the prototype system implemented for this thesis, Xilinx PlanAhead is used to perform PR compilation.

(44)

system, but should generalize well to other vendors also. First, the user’s HDL (written in the provided template) is synthesized to a netlist. This netlist is added to a new compile “run” in Xilinx PlanAhead – the netlist is assigned to all physical VFRs present in the system (in the case of the prototype, all four of them). The run is also configured to use the already placed and routed static logic, and PlanAhead ensures that all boundary crossing wires are placed in the correct locations in the PRM. The compile run is initiated, and the tool maps, places and routes the user netlist in all VFRs (which are PRRs). One design is placed and routed in all VFRs because the Agent requires the flexibility to place the user’s design in any VFR in the system – PRMs can only be configured inside PRRs for which they are specifically compiled. Bitstream generation is then completed, creating one partial bitstream for each physical VFR. These partial bitstreams will be used by the Agent to configure the user hardware into a running system when a VFR is booted in OpenStack.

Recall that user hardware is sent to the Agent as an “image”. All partial bitstreams generated by the compile are added to a compressed archive, and this archive is uploaded as an image to OpenStack by the user with the image management tool called glance. When the Agent selects which physical VFR partition to use for the users hardware, it simply selects the partial bitstream corresponding to that PRR, configures it, and disregards the others.

Figure 3.7 shows the general compile flow, upload and boot procedure. For the prototype, such a compile system is realized using a script-based approach. The user places their netlist in a specific folder, and then executes a compile script that uses PlanAhead to perform the aforementioned compilation steps and bitstream packaging. The end result is the zip archive containing the partial bitstreams, ready for upload as an image. Steps inside the dashed lines of Figure 3.7 are part of the compile script.

(45)

Chapter 3. Architecture for FPGA Virtualization 34 user_ip.v static.ncd XST (ISE 13.4) user_ip.ngc PR Compile Flow Generate Bitstream

Static Logic from provider

(placed and routed)

User IP PR Bitstreams in zipped folder Upload via glance OpenStack Image .bit .bit .bit PR Compile with Xilinx PlanAhead 13.4

(46)

Chapter 4 SAVI Testbed Prototype

A prototype of the system described in Chapter 3 has been created using the SAVI testbed at the University of Toronto. The goal of the prototype system is to validate the system architecture and show its feasibility, as well as to evaluate and attempt to quantify the benefit of reconfigurable hardware resources in an Infrastructure-as-a-Service cloud system.

4.1 FPGA Hardware

The prototype is based on the SAVI testbed OpenStack cloud, with its ability to man-age heterogeneous resources. The FPGA-based portion is implemented using the NetF-PGA10G [45], available in the SAVI testbed as a non-virtualized resource in the form of a baremetal server with NetFPGA10G connected to a PCIe slot. The NetFPGA10G comes equipped with a Xilinx Virtex 5 VTX240T, four 10G Ethernet interfaces, and 128 MB of off-chip reduced latency DRAM. Although of an older generation than currently available, the Virtex 5 [51] is still large enough to realize a sufficiently non-trivial sys-tem that demonstrates the required functionality. The static logic architecture uses as a base the NetFPGA10G open source infrastructure [52]. Certain components of the infrastructure were modified to realize the architecture described in Chapter 3:

(47)

Chapter 4. SAVI Testbed Prototype 36

1. Input Arbiter: The open source infrastructure provides an input arbiter that simply forwards packets from all four interfaces to a single output stream in a round-robin fashion. This was modified to several output streams, one corresponding to each physical VFR in the system, and a CAM was inserted in the pipeline to realize cor-rect forwarding to said VFRs. The CAM is created using Xilinx IP [53]. Additional top-level ports were also added to the Input Arbiter that connect to the write ports of the CAM. These top-level ports then connect to several memory-mapped regis-ters (implemented as GPIO peripherals) connected to the microprocessor system bus. This allows a Xilinx Microblaze soft processor to program the MAC address for any specific VFR, by writing data to the GPIO registers. A single bit in one of the registers forms aWrite signal to the CAM, which the software toggles to write the MAC and VFR region code into the CAM. The VFR datapath is clocked at a higher frequency than the embedded processor system, which may cause the CAM to be written several times in succession before the processor can write a ‘0’ to the

write bit. This is generally not a problem however, since the same location will be written with valid data each time; it is only a minor inefficiency.

2. Output Queues: The NetFPGA10G infrastructure Output Queues were also mod-ified to accomodate multiple incoming packet streams. Registers that contain the MAC addresses of each VFR were added, allowing the hardware to force source MAC address fields. Top-level ports were also added to program these registers via the same GPIO peripherals as the Input Arbiter.

All packet streams in the design are AXI Stream [54] streaming interfaces, with a width of 256 bits running at 160 MHz, equating to just over 40 Gb/s peak throughput. The stream widths are kept overprovisioned at 256 bits for the sake of simplicity. The prototype system contains four physical VFRs, each a Partially Reconfigurable Region (PRR). Each VFR region contains 11376 LUTs and 15 36K BRAMs. The number of four is chosen because it is a non-trivial number, and because it allows maintenance of

(48)

a region size that can implement meaningful hardware. It should also be noted that increasing the number of VFRs in the system also increases the number of required streaming interfaces, causing a large increase in required routing resources. Normally the placement and routing algorithms may handle this acceptably, but the problem is vastly more complicated due to the fact that PRRs are physically fixed, and the CAD tools are unable to do any optimization across the PR boundary. These constraints make it significantly more difficult for the CAD tools to meet timing, especially as the number of PRRs increases. This also contributed to the limiting of the number of VFRs.

Resource utilization for the static logic is shown in Table 4.1. The counts in Table 4.1 do not include the counts for the VFR regions. The design with four VFRs successfully meets timing with the AXI streams running at 160MHz and the soft processor system running at 100MHz. The VFRs are connected to the AXI streams, so they also run at 160 MHz and make this clock available to user hardware. User hardware must meet the 160 MHz timing constraint to work properly, as there are no other clocks provided.

Table 4.1: Resource Usage for System Static Hardware Resource Usage (Used / Device Total)

Flip-flop 29327 / 149760 (19%) LUT 28711 / 149760 (19%) 36K BRAM 105 / 324 (32%)

4.2 Agent Software

The Agent for the prototype is implemented in Python, in keeping with the rest of the OpenStack project, and because Python provides high functionality with low coding effort. Although this can come at the cost of performance, the Agent is not required to be a high performance component.

(49)

its required tasks, and is designed in a way to make it FPGA hardware agnostic – the idea being that the same Agent software can be used for any FPGA device realizing the VFR system, with minimal modifications, provided the communication protocol to the soft processor is implemented correctly.

The software contains two global objects: a statusTable object, which holds infor-mation about the status of the hardware, and a serial object, which provides RS232 send and receive functionality for communication with the soft processor system. The

serial object is implemented using the PySerial library.

4.2.1 statusTable

and Associated Objects

The statusTable global object stores the needed state of the entire FPGA hardware system. This includes:

1. The number of physical VFR regions in the managed FPGA hardware.

2. The FPGA system type. This refers to the FPGA device or part number, and the specific board it is mounted on. For example, the prototype system type is a NetFPGA10G.

3. A Python listof region objects. region objects correspond to one unique PRR (VFR) in the FPGA system.

4. A string containing information about the system type, useful for debugging.

The statusTable object also provides several methods (functions) that allow the Agent to manage the FPGA hardware. The first is (in pseudocode):

(50)

The first argument to the function (bitPkg) points to a compressed file containing the im-age – a collection of bitstreams, one matching each PRR in the system, that are essentially hardware images supplied by the user, and sent to the Agent by the OpenStack controller.

macAddris a string containing the MAC address of the VFR about to be booted. serial

is a reference to the serial object, anduuid is a string containing the OpenStack gen-erated UUID of the resource. Algorithm 1 shows the basic operation of the function.

Algorithm 1: The statusTable.program() function

Data: bitPkg, regionList, uuid, MAC

Result: A unconfigured VFR in regionList is configured with the user hardware inbitPkg, and the MAC address is set toMAC

for region in regionList do if region is not configured then

region is free, find bitstream file matching this region

pkg = uncompress(bitPkg)

for filein pkg do

if match(file, ‘‘* %d’’ % region.id) then

region.configure(file, uuid) region.setMAC(MAC) return Success end end end end

a free region was not found, fail

Virtualizing FPGAs for Cloud Computing Applications. Stuart A. Byma