EXN: Extoll Network Interface - Transmission of Ethernet Frames over Extoll

5.4 Transmission of Ethernet Frames over Extoll

5.4.6 EXN: Extoll Network Interface

The network subsystem of the Linux kernel is designed to be completely protocol- independent, which applies to both networking and hardware protocols. The in- teraction between a network interface driver and the kernel deals with one packet at a time and allows protocol issues to be hidden neatly from the driver, but also hides the physical transmission from the protocol. This section introduces the EXN module, which has been implemented as part of this work. It provides the network interface between the Extoll hardware and the TCP/IP stack.

5.4.6.1 Network Driver Overview

The Extoll Network (EXN) interface belongs to the Ethernet class and implements the EXT-Eth protocol as a loadable kernel module. Emulating Ethernet has the benefit that the implementation can take full advantage of the kernel’s generalized support for Ethernet devices. The most important tasks performed by a network interface are the data transmission and reception. Whenever the kernel needs to transmit a data packet, it calls the hard_start_transmit() method of the driver, which puts the data in the outgoing queue. EXN implements the aforementioned eager and rendezvous protocols for data transmission, and uses a reserved RMA and VELO VPID for process security. For packet reception, EXN supports both the interrupt-driven and the NAPI mode.

Figure 5.15 illustrates the path of an incoming packet for EXN running in NAPI mode. Depending on the transmission protocol, data is received either through the VELO (1a) or RMA (1b) units. VELO writes the incoming packet to the next free 128

1 e x n 0 : f l a g s =67 < UP , B R O A D C A S T , RUNNING > mtu 6 5 5 3 6 2 i n e t 1 0 . 2 . 0 . 8 n e t m a s k 2 5 5 . 2 5 5 . 2 5 5 . 0 b r o a d c a s t 1 0 . 2 . 0 . 2 5 5 3 i n e t 6 f e 8 0 ::2 c87 :4 ff : f e 0 7 :1 p r e f i x l e n 64 s c o p e i d 0 x20 < link > 4 e t h e r 2 e : 8 7 : 0 4 : 0 7 : 0 0 : 0 1 t x q u e u e l e n 1 0 0 0 ( E t h e r n e t ) 5 RX p a c k e t s 22 b y t e s 1 5 7 2 ( 1 . 5 KiB ) 6 RX e r r o r s 0 d r o p p e d 0 o v e r r u n s 0 f r a m e 0 7 TX p a c k e t s 22 b y t e s 1 5 7 2 ( 1 . 5 KiB ) 8 TX e r r o r s 0 d r o p p e d 0 o v e r r u n s 0 c a r r i e r 0 c o l l i s i o n s 0

Listing 5.1: Example output of ifconfig for Extoll network interface exn0.

message slot in a receive ring buffer associated with the used VPID. RMA, on the other hand, writes the packet directly to a pre-allocated socket buffer. Upon completion, the Extoll NIC triggers a hardware interrupt (2). Extoll can distinguish different sources of interrupts, e.g., different functional units. Therefore, EXN has two different interrupt handlers, one for VELO and one for RMA interrupts. As the processing in the interrupt context should be as low as possible, netif_rx_schedule() puts a reference to the Extoll device into the poll queue (3), which moves the packet processing in the software interrupt context. Then, net_rx_action() peeks the first entry of the poll list (4). If there are packets available for reception, the function disables all interrupts and calls the poll() method of the driver. In case of EXN, there are two different poll() methods registered, one to process VELO interrupts (5a) and one for RMA (5b). Depending on the message tag, a VELO message can either carry the payload of a packet or advertise an RMA GET operation. If it contains a payload fragment, the data is copied to a free socket buffer entry in the ring buffer, and after receiving the complete payload, passed to the upper layers (6). If it advertises an RMA operation, the poll() method writes the RMA software descriptor and initiates the GET operation. In case of an RMA interrupt (5b), the packet has already been received in a pre-allocated socket buffer and can be passed to the upper layers (6) for further processing.

Listing 5.1 displays the command line output of the ifconfig tool for the EXN interface. It can be seen that EXN utilizes the described MAC address format to encode the node ID information. IP addresses can either be statically or dynamically assigned. For static IP addresses, the interface can be configured with a preassigned IP address through an interface configuration file. Otherwise, the Dynamic Host

Configuration Protocol (DHCP) can be utilized for assigning IP addresses. By default,

5 RDMA-Accelerated TCP/IP Communication

2) The ixgbe_clean_rx_irq() phase starts with the recycling of Rx descriptors, which is done before they are returned to the hardware (ixgbe_alloc_rx_buffers()).

3) The packet data is fetched from the Rx ring: A Rx descriptor is read from the ring and a socket buffer structure (SKB) is created that points to the respective buffer (ixgbe_fetch_rx_buffer()).

4) After several sanity checks, the processing of the SKB is initiated (netif_receive_skb()).

5) The Ethertype determines how the SKB is processed. With Open vSwitch, the actual packet processing for IP packets is defined by (ovs_vport_receive()). Open vSwitch determines the outgoing interface and the output queue. At this point, the packet transmission based on NAPI and ixgbe starts.

1’) In the end of the processing, the SKB containing the packet gets scheduled for transmission (sch_direct_xmit). 2’) A Tx descriptor is prepared in the Tx ring (ixgbe_xmit_frame_ring()). If more packets are available on the Rx ring and if the poll size is not reached, the algorithm continues with step 2.

3’) In case the Tx and Rx rings were cleaned, the respective IRQ is re-enabled. If the dynamic interrupt throttling rate (ITR) is enabled, the ITR is recalculated to reprogram the NIC. Then, the poll returns to the NAPI (cf. section III-A, step 7).

C. Interrupt Throttling Rate

NAPI-based packet processing can be configured by several parameters. In case of using the ixgbe driver one of the most important parameters is the ITR. The ITR defines an upper bound of IRQs per second for a set of Tx and Rx rings. The ITR relies on a ITR timer which is set to _{IT R}1 after an IRQ was asserted. Until the ITR timer is expired, no further IRQs can be generated. If packet transmission or reception happened before the ITR timer expired, the IRQ is fired on timer expiration. Otherwise the next reception or transmission event immediately causes an IRQ. The ITR can be configured as static, dynamic, or disabled.

Disabling the ITR results in short packet latencies but has a negative impact on the maximum throughput, especially in high traffic load situations, where the CPU is often occupied with IRQ handling. Using a static ITR is suitable for manually setting the upper bound of IRQs per second which is then independent of the offered load. The increase of the ITR lowers the latency but increases the CPU utilization and may lower the maximum throughput. Hence, the appropriate configuration of the ITR is a trade-off between latency and maximum throughput.

With a dynamic ITR, the ITR is adopted according to the current traffic load. When a poll finishes, a new ITR is recalculated. The three ITR states lowest, low (initial state) and bulk are defined where each ITR state is associated to a specific ITR value in thousand interrupts per second (kips) as depicted in Fig. 3.

The current ITR state s and the throughput determine the transition to a new ITR states0_{. The new ITR}_r0 _{is calculated}

lowest 100 kips low 20 kips bulk 8 kips ≥10 MB/s ≥20 MB/s <20 MB/s <10 MB/s

Fig. 3. Interrupt throttling rate states of the ixgbe NIC driver

on basis of the current ITR r and the ITR value of the new ITR state s0 according to Eq. (1).

r0= 10 · s

0_{· r}

9 · s0_{+ r} (1)

For instance, if the current offered load is low, and thus the throughput is low, then the ITR becomes high and vice versa.

IV. LATENCYEVALUATION WITHMEASUREMENTS

For the measurement of the NAPI performance a network stack is required that utilizes the NAPI. This network stack must not introduce any unpredictable effects into the measured data to avoid corruption of our packet reception and transmission measurements. In the best case it only utilizes a constant additional share of the CPU and adds a constant additional latency per packet to the measurements. Therefore, we decided to use Open vSwitch [25]–[27] as a representative NAPI-based in-kernel packet forwarding application. Open vSwitch is part of Linux and is able to operate in layer 2 of the ISO OSI stack but also in higher layers. Previously we have shown that Open vSwitch has a predictable average per packet processing cost in terms of CPU cycles [16].

A. Measurement Setup

Our test setup is based on recommendations by RFC 2544 [28]. The device under test (DuT) is connected to a device which runs a load generator and a packet counter in order to measure the achieved throughput. For profiling of software the DuT runs the Linux tool perf to gather statistics like the interrupt rate. Profiling measurements were run for five minutes per test to get accurate results. Our tests indicate that running this utility on the DuT introduces an overhead that reduces the maximum throughput by 1 %.

The DuT uses an Intel X540-T2 dual 10 GbE NIC and is equipped with a 3.3 GHz Intel Xeon E3-1230 V2 CPU. We disabled Hyper-Threading, Turbo Boost, and power saving features that scale the frequency with the CPU load because we observed measurement artefacts with these features.

The DuT runs the Debian-based live Linux distribution Grml with a 3.7 kernel, the ixgbe 3.14.5 NIC driver with interrupts statically assigned to CPU cores. Open vSwitch is used in version 2.0.0 with manually created OpenFlow rules to match the traffic.

B. Load Generation

Our load generator is based on the high-performance packet processing framework DPDK [8]. This packet generator can reliably generate constant bit rate (CBR) traffic by utilizing rate control hardware features of a X540-based NIC [20].

Figure 5.16: Interrupt throttling rate state transitions of the ixgbe driver [126].

5.4.6.2 Protocol Thresholds for Efficient Communication

As described in section 5.4.1, EXT-Eth relies on two communication protocols for data transmission at the link layer. To switch seamlessly and efficiently between the two protocols, the EXN module implements two different protocol thresholds. Eager/Rendezvous Protocol Switch The EXN module internally switches between the eager and rendezvous protocol depending on the size of the payload. For smaller payloads, the eager protocol provides a low latency path to transmit packets through the VELO unit. For large messages, the rendezvous protocol is used. The initial “handshake” is initiated by a VELO message containing the information to setup the the GET sink on the target node. Based on Extoll micro-benchmark results, the threshold for the switch between VELO and RMA data transmission should be between 120 and 480 bytes, which translates to a maximum of four VELO packets for the eager protocol. This way, the module provides a good trade-off between latency performance and bandwidth.

NAPI Budget The Linux kernel uses the interrupt-driven mode by default and only switches to polling mode when the flow of incoming packets exceeds a certain threshold, known as the weight of the network interface. For NAPI-compliant network drivers, the budget module parameter or interrupt throttling rate (ITR) places a limit on the amount of work the driver may do, e.g., interrupts per second. Each received packet counts as one unit of work. The return value of the poll() function is the number of packets which were actually processed. If, and only if, the return value is less than the budget, a NAPI driver re-enables interrupts and turns off polling. For the EXN module, there are two NAPI budgets, one for processing incoming VELO packets and one for processing RMA notifications. For both, the default budget value is 64 packets, but it is implemented as a configurable module parameter. Dynamic ITR As previously describe, NAPI-based packet processing can be configured by the interrupt throttling rate. For the EXN module, this is a static value that only can be changed at module startup time. In order to automatically adapt 130

the ITR to the current traffic load, Intel’s 10 GbE driver ixgbe proposes a dynamic ITR [126]. When a poll finishes, the ITR is recalculated. There are three ITR states: lowest, low (initial state), and bulk. Each ITR state is associated with a specific ITR value in thousand interrupts per second (kips) as depicted in Figure 5.16. The current ITR state and the throughput determine the transition to a new ITR state. For instance, if the current load is low, and thus, the throughput is low, the ITR becomes high and vice versa. Future versions of EXN will adopt a similar mechanism.

In document Accelerating Network Communication and I/O in Scientific High Performance Computing Environments (Page 144-147)