AF_EXTL: A Prototype Implementation of EXT-DS

5.5 Direct Sockets over Extoll

5.5.4 AF_EXTL: A Prototype Implementation of EXT-DS

This section provides an overview of the design decisions and current status of the EXT-DS implementation. At first, a discussion about a user versus kernel space implementation is presented followed by an introduction of portability mechanisms to seamlessly utilize the protocol with legacy applications. Afterwards, a brief overview of the implementation status is provided. The section concludes with a description of the de-multiplexing mechanism for incoming messages.

5.5.4.1 Establishing Full Semantics and Process Management

When designing a direct sockets implementation, it is vital to be fully functional against the TCP/IP reference model. In general, two different implementation paradigms can be distinguished: user space and kernel space. While a user space implementation offers the possibility to implement a true operating system bypass, it comes with some limitations and overhead to provide full socket semantics.

Traditional TCP/IP implementations rely on system calls to exchange messages. In general, the operating system is responsible for data transfers, and also, is able to distinguish data transmission from other API calls. Typically, a socket describes an interface between the user application and the operating systems. Starting

5 RDMA-Accelerated TCP/IP Communication

TCP/IP State Transition Diagram (RFC793)

Gordon McKinney (10 Feb 2004)

A connection progresses through a series of states during its lifetime (listed below). CLOSED is fictional because it represents the state when there is no TCB, and therefore, no connection. Briefly the meanings of the states are:

LISTEN represents waiting for a connection request from any remote TCP

and port.

SYN-SENT represents waiting for a matching connection request after having

sent a connection request.

SYN-RECEIVED represents waiting for a confirming connection request acknowl-

edgment after having both received and sent a connection request.

ESTABLISHED represents an open connection, data received can be delivered

to the user. The normal state for the data transfer phase of the connection.

FIN-WAIT-1 represents waiting for a connection termination request from the

remote TCP, or an acknowledgment of the connection termination request previously sent.

FIN-WAIT-2 represents waiting for a connection termination request from the

remote TCP.

CLOSE-WAIT represents waiting for a connection termination request from the

local user.

CLOSING represents waiting for a connection termination request acknowl-

edgment from the remote TCP.

LAST-ACK represents waiting for an acknowledgment of the connection

termination request previously sent to the remote TCP (which in- cludes an acknowledgment of its connection termination request).

TIME-WAIT represents waiting for enough time to pass to be sure the remote

TCP received the acknowledgment of its connection termination request.

CLOSED represents no connection state at all.

A TCP connection progresses from one state to another in response to events. The events are the user calls, OPEN, SEND, RECEIVE, CLOSE, ABORT, and STATUS; the incoming segments, particularly those containing the SYN, ACK, RST and FIN flags; and timeouts.

CLOSED LISTEN SYN_RCVD SYN_SENT ESTABLISHED FIN_WAIT_1 CLOSE_WAIT FIN_WAIT_2 CLOSING TIME_WAIT LAST_ACK data transfer state

starting point

2MSL timeout passive open

active open

simultaneous close appl: passive open

send: <nothing> _appl:

active open send: SYN appl: send data send: SYN recv: SYN;

send: SYN, ACK recv: RST timeout

send: RST

recv: SYN send: SYN, ACK simultaneous open

recv: SYN, ACKsend: ACK appl: close send: FIN recv: ACK send: <nothing> recv: FIN send: ACK recv: ACK send: <nothing> re_{cv: FIN, ACK} send: ACK re_{cv: ACK} send: <nothing> appl: close send: FIN recv: FIN send: ACK recv: FIN send: ACK appl: close send: FIN appl: close or timeout recv: ACK send: <nothing> active close passive close

normal transitions for client normal transitions for server

appl: state transitions taken when application issues operation recv: state transitions taken when segment received send: what is sent for this transition

TCP state transition diagram.

Reprinted from TCP/IP Illustrated, Volume 2: The Implementation by Gary R. Wright and W. Richard Stevens, Copyright © 1995 by Addison-Wesley Publishing Company, Inc. Figure 5.23: TCP state transition diagram [157].

with its creation through the function call socket(), a socket can have multiple different states during its lifetime, as depicted in Figure 5.23. The direct sockets implementation has to keep these states in mind. User-level implementations do not have access to the states and would require the software to keep a copy in user space.

Another problem that arises with the user space solution is the handling of unexpected process termination, e.g., through a user initiated ctrl-C or a segmentation fault in the application. For a kernel-level implementation, the operating system serves as a central instance of control, while for a user-level implementation additional functionality would have to be implemented to achieve the same behavior.

Single point-to-point connections are considered as standard situations. Function calls such as fork() introduce a new level of complexity. For instance, fork() creates a child process by duplicating the parent process. The support of fork is crucial since many services make heavy use of it. For a user-level implementation, a dispatching instance such as the operating system is outside of the critical path. When calling fork(), all file descriptors including the socket descriptor of a process are shared between two or more processes. In user space, this poses a problem for the data receiving path and would require additional control tokens.

In summary, it is possible to design a user-level implementation, but it requires additional code to handle exceptions and functions such as fork(). To ensure a broad applicability, the implementation of EXT-DS relies on the introduction of a new sockets address family called AF_EXTL and a shared user library, which provides a transparent switching functionality between standard TCP sockets and EXT-DS. 5.5.4.2 Direct Sockets Portability and Adaptability

The Sockets interface is one of the most widely used APIs for network communication. Besides the explicit source code modification to use the AF_EXTL address family instead of AF_INET, it is of importance to provide portability mechanisms, which allow legacy applications to seamlessly utilize the direct sockets implementation. By providing a user library that utilizes library interposition, it is possible to intercept Sockets API calls and automatically switch between the different protocols.

Library interposition [158], also known as function interposition, is a powerful

linking technique that allows programmers to intercept calls to arbitrary library functions. Linking can be described as the process of collecting and combining various pieces of code and data into a single binary file that can be loaded into memory and executed. Interposition can occur at different times:

• Compile time – when the source code is compiled.

• Link time – when the relocatable object files are statically linked to form an executable object.

• Load/run time – when an executable object file is loaded into memory, dynam- ically linked, and then executed.

The following discusses possible automatic address family conversion mechanisms through function interposition at link and at run time and introduces the concepts of static and dynamic linking.

5 RDMA-Accelerated TCP/IP Communication

Low-level NIC Driver

TCP UDP ICMP EXT-DS

Network Layer Protocol (e.g., IP) Socket Layer

Socket Switch (AF_INET ↔ AF_EXTL) Application

Figure 5.24: Overview of the socket software stack.

Automatic Conversion at Link Time Static applications are statically linked, which means that all library routines are copied into the executable object by the linker. This technique may result in a larger binary file, but is both faster and more portable. For statically linked applications, the LD_PRELOAD environment variable [159] has no effect. An alternative is the wrap command line switch of the GNU linker [160]. When specified for a given symbol with --wrap=symbol, a wrapper function is called instead of symbol. Any undefined reference to symbol will be resolved to __wrap_symbol, while any undefined reference to __real_symbol will be resolved to symbol. For an implementation of EXT-DS, this would mean that for every supported Socket API call a wrapper function must be passed to the linker. Automatic Conversion at Run Time A much easier approach is offered through shared libraries. When an application is dynamically compiled against shared libraries, a list of undefined symbols is included in the application’s binary, along with a list of libraries the code is linked with. The -fPIC compile flag generates position-independent code (PIC), which is suitable for use in shared libraries. There is no correspondence between the symbols and the libraries; the two lists just tell the loader which libraries to load and which symbols need to be resolved. At runtime, each symbol is resolved using the first library that provides it. For dynamic applications, this means that function interposition can occur at run-time, and also, that applications do not need to be re-compiled against the interposition library. The intercepting library can be preloaded via the LD_PRELOAD environment variable, which contains the path to the pre-loadable library. The environment variable indicates that the user-specified shared object should be prioritized over all others libraries when resolving any symbols.

Algorithm 1 Sockets Switch

1: _{procedure Socket Switch (Protocol Family, Socket Type, Protocol)} 2:

3: if (domain == AF_INET) && (type == SOCK_STREAM) then

4: domain ← AF_EXTL; . Forward to EXT-DS

5: else if (domain == AF_INET) && (type == SOCK_DGRAM) then

6: __real_socket(AF_INET, type, protocol);

7: end if

9: if (__real_socket(domain, type, protocol) == -1) then

10: __real_socket(AF_INET, type, protocol); . Fallback to EXT-Eth

11: end if

12:

13: end procedure

5.5.4.3 Overview and Status of the AF_EXTL Module

Within the scope of this work, EXT-DS is implemented as a kernel module and registered with the kernel as a new transport layer protocol family called AF_EXTL (illustrated in Figure 5.24). It can be used to create a communication endpoint for sockets of type SOCK_STREAM, and is built on top of the Extoll software environment, especially the kernel API. In addition, a socket switch is provided through a pre- loadable, user-level shared library. The library seamlessly intercepts the endpoint creation call socket() and determines whether a standard TCP (AF_INET) or an EXT-DS (AF_EXTL) socket should be created, as shown in Algorithm 1.

The shadow socket mechanism is implemented by establishing the connection through the EXN interface, which means that standard TCP protocol function pointers are used for connection establishment. The dashed line in Figure 5.24 indicates this path. The current implementation of the EXT-DS module only supports the BCopy mode, which preserves socket semantics for legacy applications, and utilizes 16 VELO and 16 RMA VPIDs for the virtual device management. Since the entire implementation is in kernel space, it does not leverage Extoll’s kernel bypass capabilities, but the transport is offloaded to the NIC while establishing full semantics. It implements all socket entry points that can be invoked by the kernel, including bind(), release(), and connect().

The EXT-DS implementation has a very resource conservative approach and only allocates and maps AF_EXTL objects when an application is requesting a connection. In case of an error or insufficient resources, the module provides a fallback mechanism through the EXN module. The de-multiplexing of incoming RMA and VELO packets

5 RDMA-Accelerated TCP/IP Communication Tag: VELO Size Data Tag: RMA_INFO Size (no data) Tag: VELO Size Data Tag: VELO Size Data Tag: RMA_RDY Size (no data) Tag: VELO Size Data Kernel FIFO (1) (2) (3) (4) Kernel FIFO Tag: RMA_INFO Size (no data) Tag: VELO Size Data

Kernel FIFO Kernel FIFO

Figure 5.25: Overview of the Kernel FIFO usage with different message types.

to the ports is handled through a kernel FIFO structure, which is presented in the next section. At the time of this writing, only the VELO path is fully functional and used as a proof of concept.

5.5.4.4 De-Multiplexing of Incoming Messages

Each virtual device relates to one RMA VPID and one VELO VPID, and is shared among multiple port numbers. VELO and RMA notifications arrive at a virtual device and need to be forwarded to the right port structure. The de-multiplexing of incoming messages is handled by a kernel thread, which implements a progress function that snoops on the mailboxes for a given VPID for VELO and RMA. In order to avoid an interrupt-driven mechanism, the progress function either is called whenever a socket system call is triggered by the user or when the module internal timer expires. To establish a flow control for the RMA buffer resources, VELO messages are utilized and can be distinguished by their user tag, as described in section 5.5.3.1. The progress function retrieves the port number from the payload of a VELO message, and then, enqueues it to the corresponding kernel FIFO. Figure 5.25 displays different scenarios:

• When the kernel thread is running, incoming VELO messages are enqueued in the kernel FIFO. They can either indicate that (1) the VELO message has 144

payload attached, or (2) state that the next chunk of the payload is to be received through an RMA PUT operation (RMA_INFO).

• When calling the receive function, the user can only read messages up to the RMA_INFO entry (3). Then, the user gets blocked until new data is available. • When an RMA PUT completes, a notification is written to the corresponding

mailbox. The progress function retrieves the notification, matches it with the corresponding message in the kernel FIFO and sets the user tag to RMA_RDY (4), which indicates the data can be read.

In document Accelerating Network Communication and I/O in Scientific High Performance Computing Environments (Page 155-161)