Network Performance in the Linux Kernel

(1)

Fosdem, February 2021

Network

Performance in

the Linux Kernel

Maxime Chevallier

[email protected]

Corrections, suggestions, contributions and translations are welcome!

embedded Linux and kernel engineering

(2)

Maxime Chevallier

I Linux kernel engineer at Bootlin.

I Linux kernel and driver development, system integration, boot time optimization, consulting. . .

I Embedded Linux, Linux driver development, Yocto Project & OpenEmbedded and Buildroot training, with materials freely available under a Creative Commons license.

I https://bootlin.com

I Contributions:

I Worked on network (MAC, PHY, switch) engines.

I Contributed to the Marvell EBU SoCs upstream support.

I Worked on Rockchip’s Camera interface and Techwell’s

TW9900 decoder.

(3)

Preamble - goals

I Follow the path of packets through the Hardware and Software stack

I Understand the features of modern NICs

I Discover what the Linux Kernel implements to gain performances

I Go through the various offloadings available

(4)

Fosdem, February 2021

The path of a

packet

(5)

The Hardware

Link Partner CPU MAC RAM PHY

I Link Partner : The other side of the cable

I Connector : 8P8C (RJ45), SFP, etc.

I Media : Copper, Fiber, Radio

I PHY : Converts media-dependent signals into standard data

(6)

The NIC

I Network InterfaceController

I Sometimes embed a PHY (PCIe networking card)

I The MAC : Handles L2 protocol, transfers data to the CPU

(7)

Frame reception

011011001 1101 Buffer

MAC RAM

Receive queue

CPU

I The MACreceived data and writes it to RAM usingDMA

I A descriptor is created

I It’s address is put in a queue

(8)

Frame reception - IRQ

Buffer MAC RAM CPU IRQ I An interrupt is fired

(9)

Frame reception - Unqueue

011011001 1101 Buffer MAC RAM CPU Buffer 0111010011101001110101

I The Interrupt handler acknowledges the interrupt

I The packet is processed insoftirq context

I The next frame can be received in parallel

(10)

In the NIC driver

I The CPU : Processes L3 (packets) and above, up to the application

I The Interrupt Handler does very basic work, and masks interrupts

I NAPI is used to schedule the processing in batches

I Subsequent frames are also dequeued

I NAPI stops dequeueing once :

I The budget is expired (release the CPU to the scheduler)

I The queue is empty

I NAPI re-enables interrupts

I This avoids having one interrupt per frame

(11)

In the kernel networking stack

I The PCAP hook is called, then the TC hook

I The header in unpacked, to decide if :

I The packet is forwarded

I The packet is dropped

I The packed is passed to a socket

I The in-kernel data path is heavily optimized...

I ...But still requires some processing power at very high speeds

(12)

Fosdem, February 2021

Traffic Spreading

and Steering

(13)

Scaling across queues and CPUs

I Most modern systems have multi-core CPUs

I Modern NICs have multiple RX/TX queues (rxq/txq)

I Interrupted CPU does all the packet processing

I If the interrupt always goes to the same core...

I ...the other ones will stay unused

Core 0 Core 1

Core 2 Core 3

MAC

Hardware and Software techniques exists to scale processing across CPUs

(14)

Scaling

Goal : Spread packet across CPU cores

I We can’t randomly assign packets to CPUs

I Ordering must be preserved

I Memory domains should be taken into account (L1/L2 caches, NUMA nodes)

I We need to spread packets per-flow

Kernel documentation : Documentation/networking/scaling.rst

(15)

N-tuple Flows

Flow: Packets from the same emitter, for the same consumer

I Flows are identified by data extracted from the headers

I L3 flow : Source and Destination IP addresses→ 2-tuple

I L4 flow : src/dst IP + Proto + src/dst ports → 5-tuple

MAC SA MAC DA ethtype vlan

ver ToS len id checksum

SRC port DST port len, chksum, etc.

IP DA Payload Proto IP SA L2 L3 L4 2-tuple MAC SA MAC DA ethtype vlan

len, chksum, etc. IP DA DST port SRC port Payload Proto IP SA L2 L3 L4 5-tuple

(16)

N-tuple Flows

I Other flows can be interesting :

I Vlan-based flows

I Destination MAC-based flows

I Most often, these tuples are hashed prior to being used

I Allows to build smaller flow tables

I Advanced NICs are able to keep track of a high number of flows

MAC SA MAC DA ethtype vlan

len, chksum, etc. IP DA DST port SRC port Proto IP SA L2 L3 L4 proto IP SA ... Hash(key) % N key = = cpu_id

(17)

RPS : Receive Packet Steering

Core 0 Core 1

Core 2 Core 3 MAC

I The interrupted CPU schedules processing on other CPUs

I Key data is extracted from the Headers and hashed

I The Hash can be computed by the hardware

I It is then passed in the descriptor

I The CPU is chosen by masking out the first bits of the hash

(18)

Using RPS

I Kernel needs to be built with CONFIG_RPS

I The set of CPUs used depends on therxqthe frame arrives on

I echo 0x03 > /sys/class/net/eth0/queues/rx-0/rps_cpus

I echo 0x0c > /sys/class/net/eth0/queues/rx-1/rps_cpus

I →Traffic from rxq 0 is spread on CPUs 0 and 1

I →Traffic from rxq 1 is spread on CPUs 2 and 3

I Very useful for NICs with fewer rxqs than CPU cores !

(19)

RSS : Receive Side Scaling

Core 0 Core 1 Core 2 Core 3 MAC rxq0 rxq1 rxq2 I Offloaded version of RPS

I The NIC is configured to extract the header data and compute the Hash

I The CPU is chosen by means of an Indirection Table

I The NIC actually enqueues the packet into one of its queues

I The interrupt directly comes to the correct CPU !

(20)

RSS Tables

I Tables within the NIC

I Associates hashes to queues

I Tables commonly have more entries than queues I e.g. 128 entries for 4 queues

I Filling the table allows affecting weights to each queue

0x00: 0 0 0 0 0 0 0 0 0x08: 0 0 0 0 0 0 0 0 0x10: 1 1 1 1 1 1 1 1 0x18: 2 2 2 2 3 3 3 3 I rxq0 has weight 4 I rxq1 has weight 2

(21)

Using RSS

I RSS is configured through ethtool

I Enabling RSS : ethtool -K eth0 rx-hashing on

I Configuring the indirection table :

ethtool -X eth0 weight 1 2 2 1

I Dumping the indirection table : ethtool -x eth0

I Configuring the hashed fields :

ethtool -N eth0 rx-flow-hash tcp4 sdfn

I see man ethtool(8) for the meaning of each flow type

ethtool -X eth0 equal 4

ethtool -N eth0 rx-flow-hash tcp4 sdfn ethtool -N eth0 rx-flow-hash udp4 sdfn ethtool -K eth0 rx-hashing on

→Increased IP forwarding speed by a factor of 3 on the MacchiatoBin

(22)

RFS : Receive Flow Steering

I RPSand RSSdon’t care about who consumes which flow

I This might be bad for cache locality

I What if RPS/RSS sends a flow to CPU 1...

I ...but the consumer process lives on CPU 2 ?

I RFStracks the flows and their consumers

I Internally keeps a table associating flows to consumers/CPUs

I Updates indirection for a flow when the consumer migrates

1. httpdlives on CPU 0

2. RFSsteers TCP traffic to port 80 onto CPU 0

3. httpdis migrated to CPU 1

4. RFSupdates the flow table

(23)

Using RFS

I Internally, a flow tableassociates flow hashes to CPUs

I User indicates the size of the table

I echo 32768 > /proc/sys/net/core/rps_sock_flow_entries

I echo 4096 > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt

I echo 4096 > /sys/class/net/eth0/queues/rx-1/rps_flow_cnt

I ...

I We configure (32768/N(rxqs)) in each queue

I These values are recommended in the Kernel Documentation

(24)

aRFS : Accelerated Receive Flow Steering

I Advanced NICs can steer packets torxqs in Hardware

I aRFS asks the driver to configure steering rules for each flow

I Rules are updated upon migration of the consumer

I Packets always come to the right CPU !

I Kernel handles outstanding packets upon migration

I Needs support in HW, and a specific implementation in the driver

I The driver determines how to build the steering rule (n-tuple)

(25)

Using aRFS

I Kernel needs to be built with CONFIG_RFS_ACCEL

I Enable N-tuple filtering offloading :

ethtool -K eth0 ntuple on

I The NICand the driver needs to support aRFS

(26)

Flow Steering : Ethtool and TC flower

I Manually steering flows can be interesting for proper resource assignment

I This is also helpful to dedicate queues to flows, e.g. AF_XDP

I Two interfaces exists : tc flower andethtool

I Internally, bothethtoolandtcinterfaces are being merged...

I ... But for now the 2 methods coexist and can conflict I We insert steering rules in the NIC, with priorities

I Rules associate :

I Flow types : TCP4,UDP6,IP4,ether, etc.

I Filters : src-ip,proto, vlan,dst-port, etc.

I Actions : Targetrxq, drop,RSS context

I Location : Priority of the rule

(27)

Using tc flower and ethtool rxnfc

ethtool examples

I ethtool -K eth0 ntuple on

I ethtool -N eth0 flow-type udp4 dst-port 1234 action 2 loc 0 I Steer IPv4 UDP traffic for port 1234 to rxq 2

I ethtool -N eth0 flow-type udp4 action -1 loc 1 I Drop all UDP IPv4 traffic (except for port 1234)

TC flower example

I ethtool -K eth0 hw-tc-offload on

I tc qdisc add dev eth0 ingress

I tc flower protocol ip parent ffff: flower ip_proto tcp \ dst_port 80 action drop

I Drop all IPv4 TCP traffic for port 80

I tc flower falls back to software filtering if needed

(28)

RSS contexts

I Flows can also be steered to multiple queues at once

I RSS is then used to spread traffic accross queues

I This is achieved through RSS contexts

I An RSS context is simply an indirection table

I An RSS context is created with ethtool : I ethtool -X eth0 equal 4 context new

I The RSS context is uses as a destination for the flow :

ethtool -N eth0 flow-type udp4 dst-port 1234 context 1 loc 0

(29)

XPS : Transmit Packet Steering

I Upon transmitting packets, the driver executes completion

code

I Transmitting using a single CPU can also lead to cache misses

I XPSis used to select which txqto use for packet sending

I We can assigntxqs to CPUs

I Thetxqis chosen according to the CPU the sender lives on

I We can also assign txqs torxqs

I Make sure that we use the same CPU for RX and TX

I The NICdriver assignstxqs to CPUs

(30)

Using XPS

I Per-CPU mapping : I echo 0x01 > /sys/class/net/eth0/queues/tx-0/xps_cpus I echo 0x02 > /sys/class/net/eth0/queues/tx-1/xps_cpus I Assigntxq0 to CPU 0 I Assigntxq1 to CPU 1 I Per-rxq mapping : I echo 0x01 > /sys/class/net/eth0/queues/tx-0/xps_rxqs I echo 0x02 > /sys/class/net/eth0/queues/tx-1/xps_rxqs I Assigntxq0 torxq0 I Assigntxq1 torxq1

(31)

Fosdem, February 2021

Other offloading

Maxime Chevallier

[email protected]

(32)

Checksumming

I IPv4 and IPv6 include a checksum in the header

I NICs can compute checksums on the fly in TX mode

I The Kernel leaves the checksum fields empty

I tcpdump will show egress packets with a wrong checksum !!

(33)

Filtering

I Some NICs are capable of early dropping and filtering

I Frames are dropped by the NIC, no interrupt is ever fired

I MAC filtering :

I Drop frames with an unknown MAC address

I The NIC keeps information about multicast domains

I The NIC must also keep an updated list of unicast addresses

I MAC Vlansallows attaching multiple addresses to one NIC I VLAN filtering :

I Drop frames for unknown VLANs

I The NIC keeps track of VLANs attached to the interface

I ethtool -K eth0 rx-vlan-filter on

(34)

Data insertion and segmentation

I Some NICs can also insert the VLAN tag on the fly

I ethtool -K eth0 txvlan on

I The NIC will insert the VLAN Tag automatically

I ethtool -K eth0 rxvlan on

I The NIC will strip the VLAN tag

I The VLAN tag will be in the descriptor

I Some NICs can also deal with packet segmentation

I ethtool -K eth0 tso on

I Offload TCP segmentation, theNICwill generate segments

I ethtool -K eth0 ufo on

I Offload UDP frafgentation, theNICwill generate fragments

(35)

Fosdem, February 2021

XDP

Maxime Chevallier

[email protected]

(36)

Principle

I Execute a BPF program from within the NIC driver

I Executed as early as possible, for fast decision making

I Can be used to Pass,Drop or Redirectframes

I Also used for fine-grained statistics

BPF

I Berkley Packet Filter

I Programming language that can be formally verified

I Designed to write filtering rules

I Lots of hooks in the Networking Stack, XDP being the earliest

(37)

AF XDP

I Uses a combination of XDPand flow steering

I Response toDPDK : Userspace does the full packet processing

I Allows for heavily optimized and customized processing

I Special sockets that will directly receive raw buffers

I Thanks toXDP, we can select only part of the traffic for

AF_XDP

I The kernel stack is therefor not entirely bypassed...

I ...and this is a fully upstream solution !

(38)

Documentation and sources

I In the Kernel source code :

Documentation/networking/scaling.rst

I RedHat tuning guide:

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/performance_ tuning_guide/main-network

(39)

Network Performance in the Linux Kernel

Fosdem, February 2021

Network

Performance in

the Linux Kernel

Maxime Chevallier

Preamble - goals

Fosdem, February 2021

The path of a

packet

The Hardware

The NIC

Frame reception

Frame reception - IRQ

Frame reception - Unqueue

In the NIC driver

In the kernel networking stack

Fosdem, February 2021

Traffic Spreading

and Steering

Scaling across queues and CPUs

Scaling

N-tuple Flows

N-tuple Flows

RPS : Receive Packet Steering

Using RPS

RSS : Receive Side Scaling

RSS Tables

Using RSS

RFS : Receive Flow Steering

Using RFS

aRFS : Accelerated Receive Flow Steering

Using aRFS

Flow Steering : Ethtool and TC flower

Using tc flower and ethtool rxnfc

RSS contexts

XPS : Transmit Packet Steering

Using XPS

Fosdem, February 2021

Other offloading

Checksumming

Filtering

Data insertion and segmentation

Fosdem, February 2021

XDP

Principle

AF XDP

Documentation and sources

That’s it

Thank you !