Fosdem, February 2021
Network
Performance in
the Linux Kernel
Maxime Chevallier
© Copyright 2004-2021, Bootlin. Creative Commons BY-SA 3.0 license.
Corrections, suggestions, contributions and translations are welcome!
embedded Linux and kernel engineering
Maxime Chevallier
I Linux kernel engineer at Bootlin.
I Linux kernel and driver development, system integration, boot time optimization, consulting. . .
I Embedded Linux, Linux driver development, Yocto Project & OpenEmbedded and Buildroot training, with materials freely available under a Creative Commons license.
I https://bootlin.com
I Contributions:
I Worked on network (MAC, PHY, switch) engines.
I Contributed to the Marvell EBU SoCs upstream support.
I Worked on Rockchip’s Camera interface and Techwell’s
TW9900 decoder.
Preamble - goals
I Follow the path of packets through the Hardware and Software stack
I Understand the features of modern NICs
I Discover what the Linux Kernel implements to gain performances
I Go through the various offloadings available
Fosdem, February 2021
The path of a
packet
Maxime Chevallier [email protected] © Copyright 2004-2021, Bootlin. Creative Commons BY-SA 3.0 license.Corrections, suggestions, contributions and translations are welcome!
embedded Linux and kernel engineering
The Hardware
Link Partner CPU MAC RAM PHYI Link Partner : The other side of the cable
I Connector : 8P8C (RJ45), SFP, etc.
I Media : Copper, Fiber, Radio
I PHY : Converts media-dependent signals into standard data
The NIC
I Network InterfaceController
I Sometimes embed a PHY (PCIe networking card)
I The MAC : Handles L2 protocol, transfers data to the CPU
Frame reception
011011001 1101 Buffer
MAC RAM
Receive queue
CPU
I The MACreceived data and writes it to RAM usingDMA
I A descriptor is created
I It’s address is put in a queue
Frame reception - IRQ
Buffer MAC RAM CPU IRQ I An interrupt is firedFrame reception - Unqueue
011011001 1101 Buffer MAC RAM CPU Buffer 0111010011101001110101I The Interrupt handler acknowledges the interrupt
I The packet is processed insoftirq context
I The next frame can be received in parallel
In the NIC driver
I The CPU : Processes L3 (packets) and above, up to the application
I The Interrupt Handler does very basic work, and masks interrupts
I NAPI is used to schedule the processing in batches
I Subsequent frames are also dequeued
I NAPI stops dequeueing once :
I The budget is expired (release the CPU to the scheduler)
I The queue is empty
I NAPI re-enables interrupts
I This avoids having one interrupt per frame
In the kernel networking stack
I The PCAP hook is called, then the TC hook
I The header in unpacked, to decide if :
I The packet is forwarded
I The packet is dropped
I The packed is passed to a socket
I The in-kernel data path is heavily optimized...
I ...But still requires some processing power at very high speeds
Fosdem, February 2021
Traffic Spreading
and Steering
Maxime Chevallier [email protected] © Copyright 2004-2021, Bootlin. Creative Commons BY-SA 3.0 license.Corrections, suggestions, contributions and translations are welcome!
embedded Linux and kernel engineering
Scaling across queues and CPUs
I Most modern systems have multi-core CPUs
I Modern NICs have multiple RX/TX queues (rxq/txq)
I Interrupted CPU does all the packet processing
I If the interrupt always goes to the same core...
I ...the other ones will stay unused
Core 0 Core 1
Core 2 Core 3
MAC
Hardware and Software techniques exists to scale processing across CPUs
Scaling
Goal : Spread packet across CPU cores
I We can’t randomly assign packets to CPUs
I Ordering must be preserved
I Memory domains should be taken into account (L1/L2 caches, NUMA nodes)
I We need to spread packets per-flow
Kernel documentation : Documentation/networking/scaling.rst
N-tuple Flows
Flow: Packets from the same emitter, for the same consumer
I Flows are identified by data extracted from the headers
I L3 flow : Source and Destination IP addresses→ 2-tuple
I L4 flow : src/dst IP + Proto + src/dst ports → 5-tuple
MAC SA MAC DA ethtype vlan
ver ToS len id checksum
SRC port DST port len, chksum, etc.
IP DA Payload Proto IP SA L2 L3 L4 2-tuple MAC SA MAC DA ethtype vlan
ver ToS len id checksum
len, chksum, etc. IP DA DST port SRC port Payload Proto IP SA L2 L3 L4 5-tuple
N-tuple Flows
I Other flows can be interesting :
I Vlan-based flows
I Destination MAC-based flows
I Most often, these tuples are hashed prior to being used
I Allows to build smaller flow tables
I Advanced NICs are able to keep track of a high number of flows
MAC SA MAC DA ethtype vlan
ver ToS len id checksum
len, chksum, etc. IP DA DST port SRC port Proto IP SA L2 L3 L4 proto IP SA ... Hash(key) % N key = = cpu_id
RPS : Receive Packet Steering
Core 0 Core 1
Core 2 Core 3 MAC
I The interrupted CPU schedules processing on other CPUs
I Key data is extracted from the Headers and hashed
I The Hash can be computed by the hardware
I It is then passed in the descriptor
I The CPU is chosen by masking out the first bits of the hash
Using RPS
I Kernel needs to be built with CONFIG_RPS
I The set of CPUs used depends on therxqthe frame arrives on
I echo 0x03 > /sys/class/net/eth0/queues/rx-0/rps_cpus
I echo 0x0c > /sys/class/net/eth0/queues/rx-1/rps_cpus
I →Traffic from rxq 0 is spread on CPUs 0 and 1
I →Traffic from rxq 1 is spread on CPUs 2 and 3
I Very useful for NICs with fewer rxqs than CPU cores !
RSS : Receive Side Scaling
Core 0 Core 1 Core 2 Core 3 MAC rxq0 rxq1 rxq2 I Offloaded version of RPSI The NIC is configured to extract the header data and compute the Hash
I The CPU is chosen by means of an Indirection Table
I The NIC actually enqueues the packet into one of its queues
I The interrupt directly comes to the correct CPU !
RSS Tables
I Tables within the NIC
I Associates hashes to queues
I Tables commonly have more entries than queues I e.g. 128 entries for 4 queues
I Filling the table allows affecting weights to each queue
0x00: 0 0 0 0 0 0 0 0 0x08: 0 0 0 0 0 0 0 0 0x10: 1 1 1 1 1 1 1 1 0x18: 2 2 2 2 3 3 3 3 I rxq0 has weight 4 I rxq1 has weight 2
Using RSS
I RSS is configured through ethtool
I Enabling RSS : ethtool -K eth0 rx-hashing on
I Configuring the indirection table :
ethtool -X eth0 weight 1 2 2 1
I Dumping the indirection table : ethtool -x eth0
I Configuring the hashed fields :
ethtool -N eth0 rx-flow-hash tcp4 sdfn
I see man ethtool(8) for the meaning of each flow type
ethtool -X eth0 equal 4
ethtool -N eth0 rx-flow-hash tcp4 sdfn ethtool -N eth0 rx-flow-hash udp4 sdfn ethtool -K eth0 rx-hashing on
→Increased IP forwarding speed by a factor of 3 on the MacchiatoBin
RFS : Receive Flow Steering
I RPSand RSSdon’t care about who consumes which flow
I This might be bad for cache locality
I What if RPS/RSS sends a flow to CPU 1...
I ...but the consumer process lives on CPU 2 ?
I RFStracks the flows and their consumers
I Internally keeps a table associating flows to consumers/CPUs
I Updates indirection for a flow when the consumer migrates
1. httpdlives on CPU 0
2. RFSsteers TCP traffic to port 80 onto CPU 0
3. httpdis migrated to CPU 1
4. RFSupdates the flow table
Using RFS
I Internally, a flow tableassociates flow hashes to CPUs
I User indicates the size of the table
I echo 32768 > /proc/sys/net/core/rps_sock_flow_entries
I echo 4096 > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt
I echo 4096 > /sys/class/net/eth0/queues/rx-1/rps_flow_cnt
I ...
I We configure (32768/N(rxqs)) in each queue
I These values are recommended in the Kernel Documentation
aRFS : Accelerated Receive Flow Steering
I Advanced NICs can steer packets torxqs in Hardware
I aRFS asks the driver to configure steering rules for each flow
I Rules are updated upon migration of the consumer
I Packets always come to the right CPU !
I Kernel handles outstanding packets upon migration
I Needs support in HW, and a specific implementation in the driver
I The driver determines how to build the steering rule (n-tuple)
Using aRFS
I Kernel needs to be built with CONFIG_RFS_ACCEL
I Enable N-tuple filtering offloading :
ethtool -K eth0 ntuple on
I The NICand the driver needs to support aRFS
Flow Steering : Ethtool and TC flower
I Manually steering flows can be interesting for proper resource assignment
I This is also helpful to dedicate queues to flows, e.g. AF_XDP
I Two interfaces exists : tc flower andethtool
I Internally, bothethtoolandtcinterfaces are being merged...
I ... But for now the 2 methods coexist and can conflict I We insert steering rules in the NIC, with priorities
I Rules associate :
I Flow types : TCP4,UDP6,IP4,ether, etc.
I Filters : src-ip,proto, vlan,dst-port, etc.
I Actions : Targetrxq, drop,RSS context
I Location : Priority of the rule
Using tc flower and ethtool rxnfc
ethtool examples
I ethtool -K eth0 ntuple on
I ethtool -N eth0 flow-type udp4 dst-port 1234 action 2 loc 0 I Steer IPv4 UDP traffic for port 1234 to rxq 2
I ethtool -N eth0 flow-type udp4 action -1 loc 1 I Drop all UDP IPv4 traffic (except for port 1234)
TC flower example
I ethtool -K eth0 hw-tc-offload on
I tc qdisc add dev eth0 ingress
I tc flower protocol ip parent ffff: flower ip_proto tcp \ dst_port 80 action drop
I Drop all IPv4 TCP traffic for port 80
I tc flower falls back to software filtering if needed
RSS contexts
I Flows can also be steered to multiple queues at once
I RSS is then used to spread traffic accross queues
I This is achieved through RSS contexts
I An RSS context is simply an indirection table
I An RSS context is created with ethtool : I ethtool -X eth0 equal 4 context new
I The RSS context is uses as a destination for the flow :
ethtool -N eth0 flow-type udp4 dst-port 1234 context 1 loc 0
XPS : Transmit Packet Steering
I Upon transmitting packets, the driver executes completion
code
I Transmitting using a single CPU can also lead to cache misses
I XPSis used to select which txqto use for packet sending
I We can assigntxqs to CPUs
I Thetxqis chosen according to the CPU the sender lives on
I We can also assign txqs torxqs
I Make sure that we use the same CPU for RX and TX
I The NICdriver assignstxqs to CPUs
Using XPS
I Per-CPU mapping : I echo 0x01 > /sys/class/net/eth0/queues/tx-0/xps_cpus I echo 0x02 > /sys/class/net/eth0/queues/tx-1/xps_cpus I Assigntxq0 to CPU 0 I Assigntxq1 to CPU 1 I Per-rxq mapping : I echo 0x01 > /sys/class/net/eth0/queues/tx-0/xps_rxqs I echo 0x02 > /sys/class/net/eth0/queues/tx-1/xps_rxqs I Assigntxq0 torxq0 I Assigntxq1 torxq1Fosdem, February 2021
Other offloading
Maxime Chevallier
© Copyright 2004-2021, Bootlin. Creative Commons BY-SA 3.0 license.
Corrections, suggestions, contributions and translations are welcome!
embedded Linux and kernel engineering
Checksumming
I IPv4 and IPv6 include a checksum in the header
I NICs can compute checksums on the fly in TX mode
I The Kernel leaves the checksum fields empty
I tcpdump will show egress packets with a wrong checksum !!
Filtering
I Some NICs are capable of early dropping and filtering
I Frames are dropped by the NIC, no interrupt is ever fired
I MAC filtering :
I Drop frames with an unknown MAC address
I The NIC keeps information about multicast domains
I The NIC must also keep an updated list of unicast addresses
I MAC Vlansallows attaching multiple addresses to one NIC I VLAN filtering :
I Drop frames for unknown VLANs
I The NIC keeps track of VLANs attached to the interface
I ethtool -K eth0 rx-vlan-filter on
Data insertion and segmentation
I Some NICs can also insert the VLAN tag on the fly
I ethtool -K eth0 txvlan on
I The NIC will insert the VLAN Tag automatically
I ethtool -K eth0 rxvlan on
I The NIC will strip the VLAN tag
I The VLAN tag will be in the descriptor
I Some NICs can also deal with packet segmentation
I ethtool -K eth0 tso on
I Offload TCP segmentation, theNICwill generate segments
I ethtool -K eth0 ufo on
I Offload UDP frafgentation, theNICwill generate fragments
Fosdem, February 2021
XDP
Maxime Chevallier
© Copyright 2004-2021, Bootlin. Creative Commons BY-SA 3.0 license.
Corrections, suggestions, contributions and translations are welcome!
embedded Linux and kernel engineering
Principle
I Execute a BPF program from within the NIC driver
I Executed as early as possible, for fast decision making
I Can be used to Pass,Drop or Redirectframes
I Also used for fine-grained statistics
BPF
I Berkley Packet Filter
I Programming language that can be formally verified
I Designed to write filtering rules
I Lots of hooks in the Networking Stack, XDP being the earliest
AF XDP
I Uses a combination of XDPand flow steering
I Response toDPDK : Userspace does the full packet processing
I Allows for heavily optimized and customized processing
I Special sockets that will directly receive raw buffers
I Thanks toXDP, we can select only part of the traffic for
AF_XDP
I The kernel stack is therefor not entirely bypassed...
I ...and this is a fully upstream solution !
Documentation and sources
I In the Kernel source code :
Documentation/networking/scaling.rst
I RedHat tuning guide:
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/performance_ tuning_guide/main-network