4.6 DEEP Booster Architecture
4.6.2 Prototype Implementation
The DEEP booster is the first prototype implementation of the NAA communication model. The booster architecture consists of booster node cards (BNC) that are connected by the Extoll interconnect, and booster interface cards (BIC). The following sections briefly described the implementation of the BIC and BNC, followed by an introduction to the booster low-level software stack. The DEEP Project has chosen the Many Integrated Core architecture (MIC) as the target accelerator technology. The innovative features of the Extoll NIC, e.g., the PCIe root port and the SMFU, allow to operate the MIC connected to the NIC without a host. One of the main advantages of using the Intel MIC technology is that the existing Extoll kernel modules can be utilized on the accelerator cards to source and sink network traffic. 4.6.2.1 Hardware Implementation
The counterpart of the BI is the BIC. The Extoll NIC has a memory-mapped region of 16 · 8GB and a region of 16 · 128KB to manage and address up to 16 accelerator cards. The SMFU maintains two intervals per accelerator to map the complete 80
Linux Kernel Applications
Intel MPSS Driver Stack
Hardware
Applications
Hardware
PCI Surrogate
Standard System Booster Interface Node
Runtime / Library Linux Kernel Runtime / Library Intel MIC Intel MIC Extoll NIC Extoll NIC MIC-to-Extoll Extoll Driver Intel MPSS Driver Stack
Figure 4.16: BIC software stack.
MMIO configuration register and the MemBAR GDDR5 memory of the BN’s PCIe bus into the BICs address space. Four link ports from the NIC are used to connect the BIC with the 3D Torus of the booster. The BIC is responsible for booting and controlling the MICs over the Extoll interconnect. The BNC is a high-density implementation with two independent BNs, which are part of the 3D torus. A BN consists of a NIC and an accelerator, the Intel MIC. Eight BNCs are connected to a single backplane that provides connections to neighboring backplanes and their BNCs. As described in figure 4.12, the SMFU can be used to map regions of BI memory to BNs. On the BIC, two intervals are exported to the same BN. One exported region points to the BAR of the MMIO configuration register and the other region points to the BAR of the MemBAR GDDR5 memory. The MMIO registers of the Intel MIC are placed below 4GB, and the MemBAR is placed above 256GB to not interfere with the BIC’s operating system memory map. This configuration is maintained for all BNCs. The communication between a BNC and CN does not require any CPU interaction on the BIC.
4.6.2.2 Software Implementation
The NAA Booster software stack implements all of the aforementioned requirements and consists of a collection of configuration scripts and a kernel module, which maps PCI support library calls and device structures onto the Extoll software stack resources. The main advantage of this approach is the transparency to upper software layers. Once the Intel MIC driver is running on top of the virtual PCI software stack, all existing applications can be used without any modification.
The configuration scripts are used to setup the SMFU interval mapping and to configure the PCIe Bridge unit to forward PCIe configuration packets to the remote
4 Network-Attached Accelerators
accelerators and vice versa. The scripts perform RRA reads and writes to the remote Extoll NICs by utilizing the user library libRMA. The kernel module intercepts the Intel MIC driver when it initializes the PCIe device and is responsible for maintaining and mapping necessary device structures onto the Extoll SMFU memory-mapped I/O regions. Figure 4.16 displays the software stack loaded on a BIC. The virtual PCI software layer comprises of two components:
PCI surrogate layer The PCI surrogate layer replaces the PCI support library. All PCI function calls are redirected to this layer.
MIC-to-EXTOLL Layer PCI structures needed by the Intel MIC driver are mapped to the Extoll data structures for device initialization.
To be able to use the BIC software stack, the Intel MPSS is recompiled against the PCI surrogate layer header file, which is built on top of the Extoll kernel API. Device Configuration And Resource Mapping When the Intel MIC driver is loaded on a BIC, the driver initializes the connected MIC devices. Typically at module startup, the driver is registered with the PCI subsystem. The registration call is intercepted by redirecting it to the PCI surrogate layer, where a customized hardware initialization function is called. This is done by replacing the PCI header file include with #include "pci_surrogate.h". Depending on the number of connected MICs, an array of MIC descriptor structures is initialized with the start and end addresses of the MemBAR and MMIO regions. These values are needed for booting the accelerators, since the PCIe client logic registers, referred to as SBOX registers, are accessed through the MMIO regions and the Linux image is copied into the remapped MemBAR region. Instead of allocating and mapping memory regions for each MIC, the memory regions are overlapped with the SMFU’s MMIO space, which is subdivided into several intervals.
MSI Configuration and Interrupt Forwarding The second stage of the device initialization process is the MSI configuration of the MICs. This is done by writing directly to the corresponding SBOX registers, which reside in the mapped SMFU memory regions in the address space of the BIC. The vector is composed of a predefined address, a message, and the vector control. The predefined address enables the interrupt redirection to the BIC. The last step of the initialization process is the registration of the MIC interrupt handler with the EXTOLL interrupt handling subsystem. This is done by keeping a function pointer to the corresponding 82
interrupt handler within the Extoll interrupt management structure. The Extoll NIC has several functional units that are able to trigger an interrupt. The possible interrupts are divided into different trigger event groups. The Extoll driver manages possible interrupt sources in an array of function pointers, which are identified by unique tags. To handle interrupts issued by the MIC, the Extoll interrupt mechanism is extended by an additional event group and a flag indicates if MICs are present in the running system. When a hardware interrupt occurs on the Extoll card, the interrupt handler is called. The driver is able to identify the MIC’s interrupt by its event group and redirects the interrupt by calling the corresponding function pointer. 4.6.2.3 Communication Paths
One of the most important advantages is the accelerator-to-accelerator direct com- munication between BNs. All accelerators in the communication architecture are directly connected to the network. As a consequence, the number of accelerators scales independently from the number of hosts. Another key feature is that an Intel MIC runs autonomously with its own Linux operating system after device setup. All Extoll low-level kernel modules, as well as the user space libraries, have been ported to the embedded operating system, which provides full access to all functional units of the Extoll NIC. With these hardware components, one-sided communication between accelerators is supported utilizing RMA PUT and GET operations to transfer large chunks of data. For small messages, the VELO unit supports MPI send and receive operations with very low-latency two-sided communication. The SMFU can be used to distribute parts of an accelerator’s local memory to multiple different accelerators over the network and supports loads and stores to these memory regions.
The Extoll NIC provides the features necessary to build a scalable interconnection network. Figure 4.17 shows the possible communication paths within the DEEP architecture. Path (A) shows the accelerator-to-accelerator direct communication between two Intel MICs. During the boot and configuration process, path (B) is used for the OS image download, configuration, and status information. After the completion of this process, the Intel MICs can directly communicate with any other MIC in the system over path (A), receive workloads from the cluster or send results back over path (C). The coprocessor-only model for MPI applications strongly benefits from the communication path (A). All accelerators within the booster can be used to run parallel applications independently from any host.
4 Network-Attached Accelerators BICM CPU Extoll NIC IB HCA Root Complex Memory BIC0 CPU Extoll NIC IB HCA Root Complex Memory BNN Intel MIC Extoll NIC BN1 Intel MIC Extoll NIC BN0 Intel MIC Extoll NIC Ex tol l Ne tw or k ... .. . .. . .. . ... ... ... (A) (B) (C)
Figure 4.17: Communication paths.