• No results found

4.5 NAA Software Design

4.5.2 Objectives and Strategy

The main objective of the NAA software architecture is to provide a transparent mapping between the network-attached accelerator and cluster nodes by emulating a PCIe device. The implementation of the software stack needs to be transparent to upper software layers, including the accelerator device driver and runtime libraries, while maintaining the commodity aspect of the accelerators. Recapitulating the findings from section 4.3 and subsection 4.5.1, the NAA software environment needs to fulfill the following tasks:

Configuration Request Forwarding PCIe configuration read and write requests need to be forwarded to and from the remote accelerator. The NAA software stack needs to be able to recognize and forward such requests accordingly. Device Enumeration The software environment should be able to emulate the PCIe

device enumeration. After powering an accelerator device, it needs to be configured and mapped onto system resources to be ready for operation. Memory-Mapped I/O The cluster nodes should be able to map and forward access

requests the accelerators’ BAR windows, which are typically made visible to the system as MMIO regions in the main memory. This is needed for the configuration and communication with the accelerator devices, but also to provide the compatibility with PCI-specific system calls and functions.

MSI Configuration and Interrupt Delivery During the accelerator device configu- ration, the software has to modify the MSI packet (destination address and payload) by writing the corresponding values to the MSI capability register set. The layer has to be able to register the accelerator’s interrupt handler with the Extoll interrupt management. The software needs to distinguish between Extoll-based interrupts and interrupts issued by the accelerator.

Linux Kernel Applications

Accelerator Device Driver

Hardware

Applications

Accelerator Device Driver

Hardware

Virtual PCI module

Standard System NAA Node

Extoll Driver Runtime / Library Linux Kernel Runtime / Library Accelerator Accelerator Extoll NIC Extoll NIC Figure 4.11: Abstract software stack view.

Figure 4.11 shows a generic overview of the NAA software environment in compar- ison to the default Linux software stack. On the left-hand side, the typical layers of a software stack supporting an accelerator device are displayed. The goal of the NAA software environment is to remove the locally installed accelerator device and integrate it into the Extoll network, and if possible, without any modification of the original application code. In principle, achieving this goal is possible by manipulating any of the software layers on the left side. But, it is desirable to introduce as little modifications as possible.

The general idea of the NAA software approach is presented on the right-hand side of Figure 4.11. The idea is to redirect PCI support library calls to a “virtual” PCI layer, which is able to distinguish between local PCI requests and communication targeting the accelerator nodes, but also provides the means to fulfill the previously described tasks. The following sections explain the concepts needed to implement these tasks and illustrate how they can be mapped onto the Extoll technology. 4.5.2.1 PCIe Configuration Space Access

In the absence of a CPU and a root complex on the accelerator node, it is the task of a remote host system to configure the accelerator card’s PCIe host interface over the network. The Extoll NIC’s interface is designed to send and receive only PCIe memory request packets as an endpoint. To configure a PCIe hierarchy, the host interface must be able to send and receive PCIe configuration request packets.

For this purpose, the Extoll NIC features two functional units that enable the configuration and operation of remote PCIe buses: the RMA unit and the PCIe Bridge unit. In addition, the remote Extoll NIC needs to be configured to act as a root port. First, a specialized functional unit is needed to inject PCIe configuration

4 Network-Attached Accelerators

packets into the Extoll NIC’s outgoing host interface traffic. The PCIe bridge unit, introduced in section 3.2.7, resides in the on-chip network of Extoll’s network interface and can be configured by writing to the corresponding registers in the register file. The unit is accessible from every device over the network with Remote Register file

Access (RRA) transactions, which are basically immediate PUT and GET operations

to the remote registers. With this technique, a host can configure the remote PCIe bridge unit to forward incoming PCIe configuration packets to the accelerator device by writing to the PCIe backdoor register presented in Table 3.2. By writing PCIe configuration packets into the htoc_to_pcie_backdoor_data register in the remote register file via RRA transactions, they are inserted into the outgoing PCIe traffic stream from the root port to the accelerator.

4.5.2.2 Device Enumeration

A simplified enumeration process can be used, since only one device resides on the root port of a remote accelerator node. The important values that need to be configured are the bus, device and function number, BARs, and the MSI capability registers. The bus, device and function number is used to identify the accelerator inside the PCIe hierarchy, whereas the MSI capability registers define the target address and the data that is sent when an interrupt is issued from the accelerator. The BARs define a memory window which is required to enable the internal address translation for incoming request packets targeting the accelerator. This defines the way a host can access and communicate with the accelerator.

Therefore, with the remote device’s configuration space being accessible transpar- ently, all that needs to be done is to rescan the bus, on which it is to be placed. 4.5.2.3 Memory-Mapped I/O Regions

In general, peripheral components are accessed by using load and store operations to reserved address ranges in the PCIe configuration space and the memory-mapped I/O regions. The operations are mapped to an add-in card and translated into read and write requests. The easiest way to give a host access to the accelerator is to use these loads and stores to an MMIO region assigned to the accelerator. This has the additional benefit that the upper software and hardware layers can remain unchanged, but leads to the question of how to map physical addresses of the host memory to the accelerator’s PCIe address space.

The Extoll NIC’s SMFU can export segments of local memory to remote nodes to build a distributed shared memory system. Loads and stores from the CPU to 74

Cluster Node Extoll MMIO Interval ID = 0 Target ID = AN0 ... SMFU Interval 0 Interval ID = 2 Target ID = AN1 VELO Configuration RMA ... ... 00000000h FFFFFFFFh addressx In ter va l1 Interval 2 Interval ID = 1 Target ID = AN0 Accelerator Node0 00000000h FFFFFFFFh ... Interval 0 Acc. MMIO Interval ID = 0 addressx Interval ID = 1 In ter va l 1 Acc. MMIO Interval ID = 3 Target ID = AN1 Interval ID = 63 Target ID = ANm Interval 3 In ter va l63 Accelerator Node1 00000000h FFFFFFFFh ... ... In ter va l 2 Acc. MMIO Interval ID = 2 Interval ID = 3 Interval 3 Acc. MMIO SMFU SMFU iStartAddr0 iStartAddr3 iStartAddr1 iStartAddr2 iStartAddr63 tStartAddr0 tStartAddr1 tStartAddr2 tStartAddr3

Figure 4.12: Memory mapping between a cluster node and two accelerators.

these exported memory regions are encapsulated into network transactions to the remote node. At the remote node, the packets are translated back into host interface requests. The PCIe packet ordering is ensured along the path from the CPU to the target accelerator and vice versa by keeping the order of the packets received at the SMFU and forwarding them to the host or the network interface with the same order in a FIFO-like manner. With this memory mapping technique and ensured packet ordering, the MMIO ranges appear to be locally mapped to the local host’s main memory, but in reality this address range can be located anywhere in the network. The NAA software environment configures this memory mapping for every accelerator node in the system by writing the SMFU configuration to the node’s register file through remote register file accesses.

Figure 4.12 illustrates the memory mapping between a cluster node address space and two remote accelerator PCIe address spaces. The MMIO region assigned to the NIC is divided into several intervals. Each of these intervals is exclusively assigned to a functional unit. The range assigned to the SMFU is further subdivided into different

4 Network-Attached Accelerators

intervals. Each of these intervals corresponds to a region of exported memory (BAR) with the size and location defined by a start (iStartAddrn) and an end address. In

addition, each interval has an interval ID and a Target ID. The Target ID specifies the Extoll node ID of the accelerator node the loads and stores are forwarded to, while the interval ID is used to match the source interval ID of these loads and stores. In this example, two BAR windows are mapped per accelerator node. addressx hits

SMFU interval 0, which is translated to an address on AN0 in SMFU interval 0.

The SMFU on the accelerator node side adds an offset to incoming network- encapsulated request packets. The offset defines where the exported region is located in the accelerator’s memory space. If the calculated address matches the BAR assigned to the accelerator, a load or store to that address is sent to the accelerator. Note, the offset to an address in the SMFU interval has the same offset to an address in the accelerator’s BAR region. This technique allows the host CPU to directly access any accelerator connected over the Extoll network. Once the Extoll network is configured, each accelerator node has a unique node ID in the Extoll fabric. This

node ID is used to address other accelerators within the system. As a consequence,

two accelerator nodes belonging to different hosts are able to communicate with each other independently from the host systems.

4.5.2.4 MSI Configuration and Interrupt Delivery

Most PCIe devices require interrupts to fulfill their function and to communicate events between the device and the driver. PCIe devices implement the Message Signaled Interrupt (MSI) mechanism, which sends a posted write packet towards an Advanced Programmable Interrupt Controller (APIC). A write to this controller triggers an interrupt and the operating system forwards the interrupt to the corre- sponding device interrupt handler. In the network-attached accelerator architecture, there is no APIC register on the accelerator node and no direct connection be- tween the accelerator’s PCIe bus and the host systems APICs. To provide interrupt functionality, some special adjustments have to be made.

The interrupt management for accelerator devices is implemented by extending the Extoll interrupt handling on the host system to handle accelerator interrupts as well. The address of an MSI packet is stored in a PCIe configuration space register, which means that it can be modified from the remote host through configuration packets. The host sends configuration packets to the remote PCIe bus, which modifies the MSI capability registers on the accelerator card. The accelerator’s MSI address is manipulated in a way that the address is forwarded to the host, and there, it hits a 76

Host OS MIC Driver Other Handler EXTOLL Interrupt Handler

Host CPU APIC

Extoll NIC

SMFU Other

Interrupt

Sources SMFU Extoll NIC

Accelerator

Event MSI Write

Extoll Network (3D Torus)

Cluster Node Accelerator Node

Figure 4.13: Interrupt handling within the booster.

special address region. This region is mapped by Extoll to a host system’s MSI packet, which targets a valid APIC with a registered interrupt handler. Based on the payload data carried by the MSI packet, the interrupt handler can distinguish between Extoll and accelerator interrupts. Figure 4.13 displays the flow of an interrupt triggered by an accelerator node. The interrupt is forwarded over the Extoll network to the NIC of the target host, which in turn issues an Extoll interrupt.