4.8 NAA Summary
5.2.2 Sockets-like Interfaces
GMSOCKS GMSOCKS [147] is a direct sockets implementation that maps stan- dard Windows socket calls onto the Myrinet/GM device driver without the need to modify the source code, or relink or recompile the application. GM is a message passing system for Myrinet networks and includes a driver, a Myrinet-interface control program, a network mapping program, and the GM API, library and header files. GMSOCKS is implemented as a thin user space software layer in between the Winsock Direct architecture and the GM user library. The Windows socket calls are intercepted with the Detours runtime library [148], which provides dynamic interception of arbitrary Win32 binary functions at runtime on x86 machines.
Depending on the Winsock version, GMSOCKS utilizes the buffered copy or write zero-copy mode for data transfers. GMSOCKS uses so-called companion sockets to retrieve information about socket descriptors and match them with the corresponding GM information. Since GM provides only one receive queue, a worker thread dispatches incoming messages and inserts them into the corresponding receive queues of established point-to-point connections. GMSOCKS provides full semantics and is fully functional against the TCP/IP implementation.
Mellanox’s Messaging Accelerator Mellanox’s Messaging Accelerator (VMA)
[149], formerly known as Voltaire Messaging Accelerator, is an open source project. It comes as a dynamically linked user-space library and implements the native RDMA verbs API. The VMA library does not require any code changes or recompiling of 114
user applications. Instead, it is dynamically loaded via the Linux OS environment variable, LD_PRELOAD and intercepts the socket receive and send calls made to the stream socket or datagram socket address families. The VMA library bypasses the operating system by implementing the underlying work in user space.
uStream uStream [150] is a user-level stream protocol over InfiniBand and elimi-
nates context switches and data copies between kernel and user space. It utilizes threads to implement asynchronous send requests and uses internally pre-registered send and receive buffers. The communication management is split in two commu- nication channels, a data and a control channel. While this mechanism simplifies the data path, it requires extra resources to manage the connections. uStream is complemented by jStream, which provides uStream functionality to Java applications. UNH Extended Sockets Library The UNH EXS library is an implementation of the Extended Sockets API (ES-API) for RDMA over Infiniband. The ES-API specification offers a sockets-like interface that defines extensions to the traditional socket API in order to provide asynchronous I/O, but also memory registration for RDMA. UNH EXS offers both stream-oriented and message-oriented sockets. Unlike SDP and VMA, UNH EXS is not designed to run with unmodified sockets applica- tions, which simplifies the design considerably. The library introduces an algorithm that dynamically switches between buffered and zero-copy transfers over RDMA, depending on current conditions. If a send call exceeds the memory window assigned to the receive call, the additional chunks are written to a “hidden” intermediate buffer on the receiver side. To enable true zero-copy, the library implements an advert mechanism that allows direct transfers depending on the receive buffer size.
5.3 Objectives and Strategy
The TCP/IP network stack is the predominant protocol family for network communi- cation, including HPC system networks where system area networks are deployed. As presented in section 5.1, the layered architecture of TCP/IP introduces considerably high overhead and performance bottlenecks, especially for interconnection networks that provide reliable connections. Even though, it is desirable to implement TCP/IP communication means for an interconnect such as Extoll. Transparent and seamless support of legacy code, the Socket API and IP address resolution enables a broad range of applications and services to exploit the benefits of Extoll, including its RDMA and transport offloading capabilities, without any code modifications.
5 RDMA-Accelerated TCP/IP Communication
Application
Socket Switch
Protocol Switch
Ethernet over Extoll
Extoll Kernel API Extoll Device Driver
Direct Sockets over Extoll User Kernel Socket API IP TCP Kernel Bypass Data Transers
Figure 5.6: Overview of the Extoll software stack with TCP/IP extensions. For Extoll, a twofold approach is chosen. Ethernet over Extoll (EXT-Eth) provides a mechanism for emulating Ethernet communication over Extoll by encapsulating Ethernet frames in Extoll network packets. This way, a fully functional TCP/IP implementation is provided, which can leverage the Linux kernel’s support for Ethernet devices while providing IP addressing. The second pillar of the TCP/IP protocol support for Extoll is presented through the specification of the Direct
Sockets over Extoll (EXT-DS) protocol for stream sockets, which relies on EXT-Eth
for connection establishment and address resolution. The purpose of EXT-DS is to provide a transparent, RDMA-accelerated alternative to the TCP protocol by providing kernel bypass data transfers. Figure 5.6 displays an overview of the Extoll software environment with TCP/IP extensions. Recapitulating the findings from previous sections, the tasks of the software stack can be summarized as follows:
• The design and implementation should be transparent to legacy applications, and should provide IP addressing and Ethernet MAC resolution.
• Both EXT-Eth and EXT-DS should leverage the capabilities of the Extoll NIC as good as possible. The RDMA functionality should be utilized to maximize the throughput seamlessly. For EXT-Eth, RDMA can be used to maximize the MTU for the link layer. EXT-DS offers a zero-copy data transmission model. • The software should be able to transparently switch between EXT-Eth and EXT-DS. For example, a user-level protocol switch could transparently exchange the protocols.
• The design of EXT-DS should maintain the socket semantics. Legacy ap- plications should work out of the box without any code modifications or re-compilation.
Ethernet Frame (64 to 9018+ bytes)
4 bytes
Data (9000+ bytes) MAC Header (14 bytes)
Destination MAC Address Source MAC Address EtherType Payload CRC Checksum RMA Software Descriptor (24 bytes) VELO Software Descriptor (8 bytes)
Ethernet Frame Part I
(up to 120 bytes)
VELO Software Descriptor
(8 bytes)
Ethernet Frame Part II
(up to 120 bytes)
VELO Software Descriptor
(8 bytes)
Ethernet Frame Part III
(up to 120 bytes)
Figure 5.7: Ethernet over Extoll software frame format.