TCP RAPID adopts the cross-layer design — both the sender and receiver implementation consists of a congestion-control module usingtcp congestion opsinterface in TCP layer, and two link layer Qdisc modules. In this section we describe three additional challenges in designing an implementation of TCP Rapid as dynamically loadable kernel modules. These challenges are independent of the challenges in realizing ”gap-clocking” and microsecond timestamping because they deal with conditions specific to implementing
functions used by TCP Rapid inside the kernel [90].
Fast Per-frame Execution
One goal of the implementation design was to keep CPU utilization as close as possible to that of widely used TCP variants such as Cubic. Some additional utilization is unavoidable in implementing functions such as BASS de-noising and computing available bandwidth. However, these functions are executed on a per-probe-stream basis and their CPU utilizations are amortized over a number of TCP segments (64 or more). It is also important to control CPU utilization required for processing each segment. The data structures and algorithms that need to run for each segment sent or received were carefully chosen to minimize per-frame execution path lengths. Some examples are: the heap used for scheduling frame transmissions, the per-TCP-connection send queues, and the circular table holding p-stream frame times.
No Floating-point Hardware
Using the system floating-point hardware is permitted in kernel code, but its use is strongly discouraged and is very rare, confined to special heavy-computation cases such as checksums and cryptography. The reason is that the kernel does not save floating-point context when entered by system calls from user space where floating point may be in use. If a kernel module were to use the floating-point hardware, it would have to save and restore the registers and other state. The overhead of handling such a large amount of context usually negates any gain from using the hardware unless there is a large amount of computation to be done. While TCP Rapid could conveniently use floating-point in some cases, the overheads make its use undesirable.
Instead TCP Rapid uses several strategies to handle these computations. In many cases it is more efficient to use 64-bit integers for precision (e.g., nanosecond counters) and integer operations with appropriate scaling of the operands. In other cases, single-variable functions can be pre-computed and the results stored in an array with linear interpolation using integers applied to find values between two stored results. Another approach is to approximate some non-integer values as a ratio of two integers that can be easily computed with arithmetic shifts (e.g., 256). For example, 1000/1.15 (869.565) is adequately approximated for use in Rapid by (1000 * 223)>>8 (871).
Implementation for Multi-core
Obviously implementing kernel functions on multicore systems requires careful attention to protecting critical sections in the code from concurrent execution on independent CPUs. In the TCP Rapid implemen- tation, this issue is made more critical because of the data structures shared between the TCP module and the Qdisc module. As mentioned in Section 7.1, the tcp module and the Qdisc modules share a circular table which records for each packet its index in the p-stream, intended dequeue time, and the time it is acknowledged. The table is accessible from the socket private data structureinet ca priv. The egress Qdisc module adds entries to the table upon packet enqueuing; the ingress Qdisc module searches in the table for the packet corresponding to the latest ACK, and updates its acknowledgement time; the congestion control module checks whether an entire p-stream is acknowledged, and computes inter-packet arrival gaps for bandwidth estimation for that p-stream. On multicore systems, these modules can be executed concurrently, and either may be executed concurrently on a code path initiated by a system call from user space or an interrupt in kernel context.
To ensure that at any moment there is only one module accessing the circular table, the critical sections in the shared data structures are protected by kernel read/write spin-locks that also disable interrupt ”bottom half” (softirq and tasklet) processing (e.g., write lock bh()). Obtaining the lock is required to accessinet ca priv from any module. Twenty-one lock/unlock pairs are used to synchronize access to shared data. Primitive data types (e.g., integers) are protected by declaring them as ”atomic” types and using kernel guaranteed atomic operations (e.g., atomic set()).
TCP Offloading
TCP offloading is the technique to offload some functionalities in TCP/IP stack processing to the NIC. The most important offloading functions are TCP segmentation offloadint (TSO) and large receiving offloading (LRO). For outbound traffic, TSO allows the CPU to send large chunks of data to the NIC, which will then divide packets into MTU size and encapsulate them. For inbound traffic, LRO merges multiple packets from a specific stream into a larger packet to reduce the number of NIC interrupt. Such offloading is widely used with gigabit NIC interfaces in order to free up CPU cycles from high-rate packet processing. But it deprives the operation system the luxury to manipulate in per-packet granularity, which is necessary for gap-creation in RAPID implementation for both sender and receiver side. On the sender side, the egress scheduler controls each individual packet transmission before they are handed to the NIC, and the ingress scheduler has to replace the timestamps for each ACK packet. Similarly on the receiver side, the ingress scheduler requires to precisely record arrival time for each ACK packet after they are handed from the NIC to the operation system, and the egress scheduler writesµs-resolution timestamps in the header of each returning ACK. As a result, TSO and LRO are turned off on both the sender and receiver side to make RAPID function properly. To observe the effects of turning off offloading, we run aniperfflow transmitting packets at 10Gbps, and monitor the CPU utilization on both the sender and the receiver. Without TCP offloading, the CPU utilization increases from 21.6% to 30.6% at the sender, and the from 34.2% to 45.2% at the receiver.