The improvement and optimization of direct communication between accelerator devices is one of the main research topics of current research. The following sec- tions provide a selection of Intel Xeon Phi and GPU-related research efforts in the communication domain followed by an overview of hardware-related research.
4.4.1 Intel Xeon Phi Coprocessor-based Communication Models
DCFA The direct communication facility for many-core based accelerators (DCFA) [89] targets the implementation of direct data transfers for many-core architectures and utilizes the InfiniBand technology. The internal structures of an Infiniband HCA are mapped to memory areas of the host and the MIC. The MIC reads or writes data by directly accessing the memory areas of a remote host or remote MIC. MPI communication primitives executed on the MIC transfer data by issuing commands to the HCA. The implementation is based on the Mellanox InfiniBand (IB) driver. DCFA-MPI [90] provides an MPI implementation based on the DCFA framework for direct peer-to-peer communication between remote Intel MICs.MVAPICH2-MIC MVAPICH2-MIC [91, 92] is a proxy-based communication frame- work using InfiniBand and the symmetric communication interface (SCIF) [87]. Its goal is to optimize the collection of communication paths possible in the symmetric mode. The framework provides three different communication designs. DirectIB provides direct MPI communication through the Coprocessor Communication Link (CCL) driver of the MPSS software stack. The passive proxy handles host-staged communication by setting up staging buffers, but is not directly involved in com- munication between a MIC and a remote MIC. It utilizes the RDMA capability of SCIF. The active proxy utilizes a dedicated processor core on the host. The one-hop variant initiates and progresses communication that is staged through the host. SCIF transfers are initiated on the host. The two-hop variant tries to utilize the high bandwidth channels of the MIC. A MIC-to-remote-MIC transfer is staged to the local host then to the remote host and finally to the remote MIC.
HAM and HAM-Offload HAM [93] is implemented as a C++ template library to create type-safe heterogeneous active messages (HAM). Active messages can contain or reference to code that should be run upon receipt. The main question is how to translate between handler addresses of heterogeneous binaries with minimal cost. HAM solves the problem by adding a level of indirection that is implemented in pure C++ without any language extensions. The message handler registry acts like a map between keys and handler addresses. Each process has the same set of keys, but the handler addresses differ between the individually compiled binaries for different instruction set architectures. The HAM-offload API [94] completes the functionality of HAM with a unified intra- and internode offload API. It provides similar primitives as other offload programming models. The framework is compiler independent and provides communication backends for MPI and SCIF.
4.4.2 GPU Virtualization and Communication Techniques
rCUDA rCUDA [95] or remote CUDA is a framework for remote GPU virtualiza- tion in cluster environments. It is fully compatible with the CUDA runtime and transparently allocates one or more (local or remote) CUDA-enabled GPUs to a single application. rCUDA implements a client-server architecture. On the client side, the rCUDA wrapper library intercepts calls to the CUDA runtime and forwards them to the server side. The rCUDA server daemon, running on each node, offers acceleration services, receives the forwarded requests and runs the CUDA kernel. The software overhead reduces the performance in comparison to pass-through technologies.
4 Network-Attached Accelerators
VirtualCL VirtualCL (VCL) [96] is very similar to rCUDA in its basic concepts. It provides transparent access to accelerator devices on remote nodes and allows applications to utilize accelerators on different nodes without requiring the application to explicitly split its computations between these different nodes. The application itself only runs on a single node and VCL executes OpenCL kernels on other nodes when necessary. VCL consists of three components: the VCL library, the broker and a back-end daemon. The VCL library implements OpenCL and transparently accesses OpenCL devices on the cluster. The broker is a daemon running on each host system. It is responsible for monitoring the availability of OpenCL devices and allocates available devices for a client. It also manages communication between applications and back-ends. The back-end daemon runs on every node that contains usable accelerator devices. It uses vendor-specific OpenCL libraries to run OpenCL kernels on its devices, when requested by a client.
NVIDIA Grid Front-End virtualization, as it is implemented by frameworks like rCUDA or VCL, offers great flexibility and reduces the required hardware while increasing the utilization of available devices, but comes with two major disadvantages. First, there is an overhead due to the software involved on client and server side. Second, these frameworks are completely tied to a specific API, like CUDA or OpenCL. To avoid these problems, the NVIDIA Grid technology [97] offers back-end virtualization for virtual machines. The NVIDIA Kepler GPU design implements a memory management unit (MMU) and dedicated input buffers for each virtual machine. This way, different virtual machines can simultaneously use a single GPU without interfering with each other and without the software overhead of API interception. NVIDIA Grid offers high performance without penalizing multiplexing. However, special hardware support is required, restricting this technique to a limited number of devices, which can be configured to support such a functionality.
GPUDirect RDMA GPUDirect RDMA [98] is a feature of the CUDA runtime environment and was first introduced in Kepler-class GPUs with CUDA 5.0. This technique provides a direct peer-to-peer data path between two GPUs by mapping the GPU memory to one of the GPU’s BARs. Other peripheral devices can use this physical address to communicate directly with the GPU. Remote RDMA communi- cation is improved by removing unnecessary memory copies between GPU memory and host memory. But, the ratio between CPUs and GPUs is still fixed.
Global GPU Address Space The Global GPU Address Spaces (GGAS) [99] concept facilitates direct communication between distributed GPUs while bypassing the CPUs for all communication and computational tasks. GGAS relies on thread-collective communication, and therefore, maintains the GPUs bulk-synchronous, massively parallel programming model. Furthermore, GGAS utilizes a zero-copy technique for data movement between distributed GPU memories. This technique relies on overlapping shared GPU memory segments with SMFU address space of the Extoll NIC building a distributed shared, and therefore global, GPU address space. The major limitation of this approach is the requirement that a GPU device needs to support GPUDirect RDMA in order to span the global address space.
4.4.3 Hardware-related Research
The discussion about a cluster of accelerators was first introduced with the QPACE supercomputer [100] prototype. QPACE was a massively parallel quantum chromo- dynamics prototype enhanced by Cell BE processors. In 2010, QPACE was ranked #1 on the Green500 list [101]. Other approaches tend to use PCIe (Peripheral Component Interconnect Express) as the interconnection network between accel- erators. Non-Transparent Bridges (NTB) [102], [103] connect independent PCIe hierarchies of different nodes. Communication between accelerators relies on address translation and table based addressing schemes inside the NTB. This introduces additional management overhead and does not scale very well. In addition, PCIe lacks of some interconnection network features. Advanced Switching Interconnect (ASI) [104] tries to extend the PCIe protocol to support features such as protocol tunneling, routing, and congestion management. The goal of independent scalabil- ity of hosts and accelerators is not achieved with these approaches. Recently, the NVIDIA NVSwitch technology [105] together with the NVIDIA DGX-2 system [106] have been introduced. NVSwitch is an on-node switch with 18 NVLink ports per switch. Internally, it is a fully connected crossbar. The NVIDIA DGX-2 utilizes the NVSwitch technology to connect 16 NVIDIA Tesla V100 GPUs to one host system.