Lagopus: High-performance Software OpenFlow Switch for Wide-area Network

(1)

Lagopus:

High-performance Software OpenFlow

Switch for Wide-area Network

A bird in cloud

Yoshihiro Nakajima

NTT Labs

(2)

Agenda

 Trends

 Motivation and Target

 vSwitch

 Packet processing overview

 Challenges with Intel x86 servers

 Technologies for speed

 Design

 Implementation

 Evaluation

 Future Development Plan

 Open Source Activity

(3)

Our mission

Innovate networking

with

(4)

Agenda

 Trends

 Motivation and Target

 vSwitch

 Packet processing overview

 Challenges with Intel x86 servers

 Technologies for speed

 Design

 Implementation

 Evaluation

 Future Development Plan

 Open Source Activity

(5)

Trend shift in networking



Closed (Vender lock-in)



Yearly dev cycle



Waterfall dev



Standardization



Protocol



Special purpose HW / appliance



Distributed cntrl



Custom ASIC / FPGA



Open (lock-in free)



Monthly dev cycle



Agile dev



DE fact standard



API



Commodity HW/ Server



Logically centralized cntrl



Merchant Chip

(6)

What is Software-Defined Networking?

Innovate services and applications

in software development speed.

Reference: http://opennetsummit.org/talks/ONS2012/pitt-mon-ons.pdf

Decouple control plane and data plane

→Free control plane out of the box

（OpenFlow is one of APIs）

Logically centralized view

→Hide and abstract complexity of networks,

provide entire view of the network

Programmability via abstraction layer

→Enables flexible and rapid service/application

development

(7)

OpenFlow vs conventional NW node

OpenFlow controller

OpenFlow switch

Data-plane

Flow Table

Control plane

(Routing /

swithching)

Data-plane

(ASIC, FPGA)

Control plane

(routing/ switching)

Flow match

action

Flow match

action

counter

OpenFlow Protocol

Flexible flow match pattern

(Port #, VLAN ID, MAC addr,

IP addr, TCP port #)

Action (frame processing)

(output port, drop,

modify/pop/push header)

Flow statistics

(# of packet, byte size,

flow duration,… )

counter

Conventional NW node

OpenFlow

Flow

Table

#2

Flow

Table

#3

(8)

Why SDN?

Differentiate services (Innovation)

• Increase user experience

• Provide unique service which is not in the

market

Time-to-Market（Velocity）

•Not to depend on vendors’ feature roadmap

•Develop necessary feature when you need

Cost-efficiency

•Reduce OPEX by automated workflows

•Reduce CAPEX by COTS hardware

(9)

Evaluate the benefits of SDN

by implementing

(10)

(11)

Agenda

 Trends

 Motivation and Target

 vSwitch

 Packet processing overview

 Challenges with Intel x86 servers

 Technologies for speed

 Design

 Implementation

 Evaluation

 Future Development Plan

 Open Source Activity

(12)

Lessons learned from existing vswitch

 Not enough performance

 Packet processing speed < 1Gbps 

 10K flow entry add > two hours 

 Flow management processing is too heavy 

 Develop & extension difficulties

 Many abstraction layer cause confuse for me 

• Interface abstraction, switch abstraction, protocol abstraction

……… Packet abstraction, Too complex 

 Many packet processing codes exists for the same

processing 

• Userspace, Kernelspace

 Invisible configuration exists

 Invisible flow rules cause chaos debugging in flow

design

(13)

Motivation of vSwitch

 Agile and flexible networking

 Full automation in provisioning, operation and

management

 Seamless networking for customers

 Server virtualization and NFV needs

a high-performance software switch

 Short latency

 Wire-rate in case of short packet (64B)

 No High-performance OpenFlow 1.3

software switch for wide-area networks

 1M flow rules

 10Gbps-wire-rate

(14)

vSwitch requirement from user side

1. Run on the commodity PC server and NIC

2. Provide a gateway function to allow connect

different various network domains

 Support of packet frame type in DC, IP-VPN, MPLS and

access NW

3. Achieve 10Gbps-wire rate with >= 1M flow rules

 low-latency packet processing

 flexible flow lookup using multiple-tables

 High performance flow rule setup/delete

4. Run in userland and decrease tight-dependency

to OS kernel

 easy software upgrade and deployment

5. Support various management and configuration

protocols.

(15)

Why Intel x86 server for networking?

 Strong processing power

 Many CPU cores

• 8 cores/CPU (2012), 12 cores/CPU (2013),…

 Rapid development cycle

 Available everywhere

 Can be purchased in Akihabara!

 3 month lead time in case of special purpose HW

 Reasonable price and flexible configuration

 8 core CPU (ATOM) with 1,000 USD

 Can chose configuration according to your budget

 Many programmers and engineers

(16)

Agenda

 Trends

 Motivation and Target

 vSwitch

 Packet processing overview

 Challenges with Intel x86 servers

 Technologies for speed

 Design

 Implementation

 Evaluation

 Future Development Plan

 Open Source Activity

(17)

Take a bible

Typical packet processing

RX

packet

Frame parsing

Flow lookup

(match/action)

QoS・Queue

TX

packet

Processing in NIC driver

Parse packet for post prosessing

Lookup,

Header rewrite

Policer, Shaper

Marking

(18)

Typical switch architecture

Switch agent

(dynamic routing handling,

dataplane ctrl)

Dataplane

(packet processing

forwarding)

Other control / management system

BGP, OSPF, OpenFlow

Events (link up, down)

statistics

Dataplane cntrl

message

Flow table

statistics

CLI

BGP

(19)

OpenFlow

OpenFlow controller

OpenFlow switch

Data-plane

Flow Table

Control plane

(routing/ switching)

Flow match

action

Flow match

action

counter

OpenFlow Protocol

Flexible flow match pattern

(Port #, VLAN ID, MAC addr,

IP addr, TCP port #)

Action (frame processing)

(output port, drop,

modify/pop/push header)

Flow statistics

(# of packet, byte size,

flow duration,… )

counter

OpenFlow

Flow

Table

#2

Flow

Table

#3

(20)

# of packet to be proceeded for 10Gbps

0 2,000,000

4,000,000

6,000,000

8,000,000

10,000,000

12,000,000

14,000,000

16,000,000

0

256

512

768 1024

1280

# o

f packe

ts

pe

r se

cond

s

Packet size (Byte)

Short packet 64Byte

14.88 MPPS, 67.2 ns

Computer packet 1KByte

(21)

Agenda

 Trends

 Motivation and Target

 vSwitch

 Packet processing overview

 Challenges with Intel x86 servers

 Technologies for speed

 Design

 Implementation

 Evaluation

 Future Development Plan

 Open Source Activity

(22)

Intel x86 server configuration

NIC

CPU

Memory

NIC

QPI

PCI-Exp

(23)

Recent CPU details

 CPU socket has multiple CPU cores

 6 – 18 CPU cores in Intel Haswell

 CPU has hieratical $ system

 Each CPU core has separate L1 and L2 $

 Multiple CPU cores share L3 $

 CPU core are connected with multiple ring network

 Multiple memory controller

 4 memory controller on Intel Xeon E5-2600

 2 memory controller on Desktop CPU

(24)

CPU $ and memory

 CPU $ accelerate speed using data locality

 Data processing in CPU $ can be performed so fast!

 Data access to main memory takes long time

 Data prefetching mechanism can be

leveraged when sequential data access

Size

(Byte)

Bandwidth

(GB/s)

Latency -

random

L1 $

32K

699 4 clock

L2 $

256K

228 12 clock

L3 $

6M

112 44 clock

(25)

processing (2Ghz CPU)

0 1000

2000

3000

4000

5000

6000

L2 forwarding

IP routing

L2-L4 classification

TCP termination

Simple OF processing

DPI

# of required CPU cycles

(26)

with 1 CPU core

0 2,000,000

4,000,000

6,000,000

8,000,000

10,000,000

12,000,000

14,000,000

16,000,000

0

256

512

768 1024

1280

# o

f packe

ts

pe

r se

cond

s

Packet size (Byte)

Short packet 64Byte

14.88 MPPS, 67.2 ns

• 2Ghz: 134 clocks

• 3Ghz: 201 clocks

Computer packet 1KByte

1.2MPPS, 835 ns

• 2Ghz: 1670 clocks

• 3Ghz: 2505 clocks

(27)

Standard packet processing on

Linux/PC

NIC

skb_buf

Ethernet Driver API

Socket API

vswitch

packet

buffer

Data plane

Userspace

packet processing

(Event-based)

1. Interrupt

& DMA

2. system call

(read)

User

space

Kernel

space

Driver

4. DMA

3. system call (write)

Kernel space

packet processing

(Event-based)

NIC

skb_buf

Ethernet Driver API

Socket API

vswitch

packet

buffer

1. Interrupt

& DMA

vswitch

Data plane

agent

2. DMA

(28)

Standard packet processing on

Linux/PC

NIC

skb_buf

Ethernet Driver API

Socket API

vswitch

packet

dataplane

1. Interrupt

& DMA

2. system call

(read)

User

space

Kernel

space

Driver

4. DMA

3. system call (write)

NIC

skb_buf

Ethernet Driver API

Socket API

vswitch

packet

1. Interrupt

& DMA

vswitch

dataplane

agent

2. DMA

Contexts switch

Interrupt-based

Many memory

copy / read

Userspace

packet processing

(Event-based)

Kernel space

packet processing

(Event-based)

(29)

Performance issues

1. Massive RX interrupts handling for NIC device

2. Heavy overhead of task switch

3. Lower performance of PCI-Express I/O and

memory bandwidth compared with CPU

4. Shared data access is bottleneck between

threads

5. Huge amount of $ miss/update in data

translation lookaside buffer (TLB) (4KB)

(30)

Agenda

 Trends

 Motivation and Target

 vSwitch

 Packet processing overview

 Challenges with Intel x86 servers

 Technologies for speed

 Design

 Implementation

 Evaluation

 Future Development Plan

 Open Source Activity

(31)

Technologies for speed

 Leverage multi-core CPU

 Pipeline processing

 Parallel processing

 Multi thread programming

 Leverage high-performance I/O libs

 Intel DPDK

 Leverage lock-free memory libs

 lock-less queue

 RCU (read copy update)

 Design high performance flow lookup

 Refer bible

(32)

Basic techniques for performance (1/2)

 Pipeline

 Divide whole processing to processing elements

 Give additional time for processing (x # of elements)



Parallelization

 Parallel execution when multiple processing units

 Increase PPS throughput (x # of processing units)

lookup

RX

action

QoS

TX

RX

lookup

action

QoS

TX

lookup

RX

action

QoS

TX

lookup

RX

action

QoS

TX

lookup

RX

action

QoS

TX

lookup

RX

action

QoS

TX

lookup

RX

action

QoS

TX

RX

lookup

action

QoS

TX

lookup

(33)

Basic techniques for performance (2/2)

 Packet batching

 Reduce # of lock in flow lookup table

• Frequent locks are required

– Switch OpenFlow agent and counter retrieve for SNMP

– Packet processing

Input packet

Lookup table

Input packet

Lookup table

Lock

Unlock

●Naïve implementation

●Packet batching implementation

Lock is required

for each packet

Lock can be reduced

due to packet batching

(34)

Implementation techniques for speed

 Bypass in packet processing

 Direct data copy from NIC buffer to CPU $

(Intel Data Direct I/O)

 Kernel processing bypass of network stack

 Fast flow lookup using flow $ mechanism

 Reduce memory access and leverage locality

 High-speed flow lookup algorithm

 Utilize thread local storage

 Explicit assignment of CPU and processing

thread

 Hardware-aware assignment

(Topology of CPU, bus, and NIC)

(35)

Intel Data Plane Development Kit

 x86 architecture-optimized

data-plane library and NIC

driver

 Memory structure-aware queue,

buffer management

 packet flow classification

 polling mode-based NIC driver

 Low-overhead & high-speed

runtime optimized with

data-plane processing

 Abstraction layer for hetero

server environments

(36)

(37)

DPDK Apps

NIC

skb_buf

Ethernet Driver API

Socket API

vswitch

packet

buffer

Dataplane

1. Interrupt

& DMA

2. system call

(read)

User

space

Kernel

space

Driver

4. DMA

3. system call (write)

NIC

Ethernet Driver API

Socket API

vswitch

packet

buffer

agent

1. DMA

Write

2. DMA

READ

DPDK

dataplane

Userspace

packet processing

(Event-based)

DPDK apps

(polling-based)

(38)

Agenda

 Trends

 Motivation and Target

 vSwitch

 Packet processing overview

 Challenges with Intel x86 servers

 Technologies for speed

 Design

 Implementation

 Evaluation

 Future Development Plan

 Open Source Activity

(39)

Target of Lagopus vSwitch

 High performance soft-based OpenFlow switch

 10Gbps wire-rate packet processing / port

 1M flow entries

 Expands the SDN apps to wide-area networks

 Not only for data centers

 WAN protocols, e.g. MPLS and PBB

(40)

Approach for vSwitch development

 Simple is better for everything

 Packet processing

 Protocol handling

 Straight forward approach

 No OpenFlow 1.0 support

 Full scratch (No use of existing vSwitch code)

 User land packet processing as much as

possible, keep kernel code small

 Kernel module update is hard for operation

 Every component & algorithm can be

replaced

(41)

vSwitch design

 Simple modular-based

design

 Switch control agent

 Data-plane

 Switch control agent

 Unified configuration data

store

 Multiple management IF

 Data-plane control with HAL

 Data-plane

 High performance I/O

library (Intel DPDK)

(42)

Implementation strategy for vSwitch

1. Massive RX interrupts handling for NIC device

=> Polling-based packet receiving

2. Heavy overhead of task switch

=> Explicit thread assignment, no kernel

scheduler (one thread/one physical CPU)

3. Lower performance of PCI-Express I/O and memory

bandwidth compared with CPU

=> Reduction of # of access in I/O and memory

4. Shared data access is bottleneck between threads

=> Lockless-queue, RCU, batch processing

5. Huge amount of $ miss/update in data translation

lookaside buffer (TLB) (4KB)

(43)

Agenda

 Trends

 Motivation and Target

 vSwitch

 Challenges with Intel x86 servers

 Design

 Implementation

 Evaluation

 Future Development Plan

 Open Source Activity

(44)

Packet processing using multi core CPUs



Exploit many core CPUs

 Reduce data copy & move (reference access)

 Simple packet classifier for parallel processing in I/O RX



Decouple I/O processing and flow processing

 Improve D-cache efficiency



Explicit thread assign to CPU core

NIC 1

RX

NIC 2

RX

I/O RX

CPU0

I/O RX

CPU1

NIC 1

TX

NIC 2

TX

I/O TX

CPU6

I/O TX

CPU7

Flow lookup

packet processing

CPU2

Flow lookup

packet processing

CPU4

Flow lookup

packet processing

CPU3

Flow lookup

NIC 3

RX

NIC 4

RX

NIC 3

TX

NIC 4

TX

NIC RX buffer

Ring buffer

(45)

Packet batching

 Reduce # of lock in flow lookup table

 Frequent locks are required

• Switch OpenFlow agent and counter retrieve for SNMP

• Packet processing

Input packet

Lookup table

Input packet

Lookup table

Lock

Unlock

●Naïve implementation

●Packet batching implementation

Lock is required

for each packet

Lock can be reduced

due to packet batching

(46)

Bypass pipeline with flow $

●Naïve implementation

_table1

_table2

table3

Input packet

table1

table2

table3

Input packet

Flow cache

1. New flow

2. Success flow

Output

packet

Multiple flow $

generation & mngmt

Write flow $

●Lagopus implementation

Output

packet



Reduce # of flow lookup in multiple table

 Exploit flow $ that allows bypass of

multi-flow-lookup

(47)

Agenda

 Trends

 Motivation and Target

 vSwitch

 Packet processing overview

 Challenges with Intel x86 servers

 Technologies for speed

 Design

 Implementation

 Evaluation

 Future Development Plan

 Open Source Activity

(48)

OpenFlow 1.3 Conformance Status

 Conformance test results by Ryu Certification

(49)

Performance Evaluation

 Summary

 Throughput:

10Gbps wire-rate

 Flow entries:

1M flow entry support

• >= 4000 flow rule manipulation / sec

 Evaluation models

 WAN-DC gateway

• MPLS-VLAN mapping

 L2 switch

(50)

Evaluation scenario

Usecase：Cloud-VPN gateway

Tomorrow：

• Automatic connection setup via North Bound APIs

• SDN controller maintain mapping between tenant logical

network and VPN

• Routes are advertised via eBGP, no need to configure ASBRs

on provider side

Data center

Network

MPLS-VPN

MPLS

SDN

controller

eBGP

OF

API

From ONS2014 NTT COM

Ito-san presentation

(51)

Performance Evaluation



Evaluation setup



Server spec.

 CPU: Dual Intel Xeon E5-2660

• 8 core(16 thread), 20M Cache, 2.2 GHz, 8.00GT/s QPI, Sandy

bridge

 Memory: DDR3-1600 ECC 64GB

• Quad-channel 8x8GB

 Chipset: Intel C602

 NIC: Intel Ethernet Converged Network Adapter X520-DA2

• Intel 82599ES, PCIe v2.0

Server

Lagopus

Flow

table

tester

Flows

Throughput (bps/pps/%)

Flow

rules

Packet size

Flow

cache

(on/off)

(52)

WAN-DC Gateway

Throughput vs packet size, 1 flow, no flow-cache

0

1

2

3

4

5

6

7

8

9

10

0

200

400

600

800 1000

1200

1400

1600

Th

ro

u

gh

p

u

t

(G

b

p

s)

Packet size (byte)

10 flow rules

100 flow rules

1k flow rules

10k flow rules

100k flow rules

1M flow rules

(53)

WAN-DC Gateway

Throughput vs packet size, 1 flow, flow-cache

0

1

2

3

4

5

6

7

8

9

10

0

200

400

600

800 1000

1200

1400

1600

Th

ro

u

gh

p

u

t

(G

b

p

s)

Packet size (byte)

10 flow rules

100 flow rules

1k flow rules

10k flow rules

100k flow rules

1M flow rules

(54)

WAN-DC Gateway

Throughput vs flows, 1518 bytes packet

0

1

2

3

4

5

6

7

8

9

10

1

10

100 1000

10000

100000

1000000

Th

ro

u

gh

p

u

t

(G

b

p

s)

flows

10k flow rules

100k flow rules

1M flow rules

(55)

L2 switch performance (Mbps)

10GbE x 2 (RFC2889 test)

0 1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

LINC

OVS

(netdev)

(kernel)

OVS

(software)

Lagopus

Mbps

72

128

256

512 1024

1280

1518

Intel xeon E5-2660 (8cores, 16threads),

Intel X520-DA2, DDR3-1600 64GB

(56)

Agenda

 Trends

 Motivation and Target

 vSwitch

 Challenges with Intel x86 servers

 Design

 Implementation

 Evaluation

 Future Development Plan

(57)

Y2014 development plan

 Version 0.1, 2014 Q2

 Pure OpenFlow switch

 Intel DPDK 1.6.0

 CLI

 Version 0.3, 2014 Q4

 Hypervisor integration

 Virtual NIC with KVM

 KVM and OpenStack for NFV

 Version 0.7, 2015 Q1

 40Gbps-wirerate high-performance switch

 Legacy protocol (bridge, L3)

 New configuration system (OVSDB, OF-CONFIG, new CLI)

 Version 1.0, 2015 Q2

 Tunnel handling (VxLAN, GRE) for overlay NW in L3 NW

 SNMP

 Hardware implementation

(58)

One more thing 

Location-aware

bandwidth-controlled WiFI with Ryu and Lagopus

@ SDN Japan 2014

(59)

Concept

 Good audience can connect to

the Internet with guaranteed BW

 Front area sheet, good connection.

 Back area sheet, poor connection

(60)

Location-aware

(61)

Location-aware

bandwidth-controlled WiFi

Front

(62)

Location-aware

bandwidth-controlled WiFi

Front

Screen

50Mbps 30Mbps 20Mbps

Back

Good audience can connect to

the Internet with guaranteed BW

(63)

Result

 Many audience in back area sheet use

mobile WIFI (tethering by smartphone) to

connect to the Internet 

 Next step:

 We need 3G or LTE mobile radio jamming hardware

to shut down ce

(64)

Network topology

cat3750x

wlc

rtx1200

trunk:

WIFI controller

NAT/DHCP

server

untagged:

(65)

Lagopus: Flow table design

meter

wlc

rtx1200

cat3750x

①

②

③

id: 0 (MAC learning)

match: src mac

action: pop vlan,

goto id:1

default: packet-in

id: 1 (Flooding or forwarding)

default: flood

match: dst mac

action: push vlan,

meter, output

50Mbps

30Mbps

20Mbps

Group

: flooding actions

③ pkt -> flood

② pkt -> flood

① put -> flood

(66)

Lagopus: Group table design

 Group Type: ALL, SELECT, INDIRECT, FF

 ALL for flooding

group

③ pkt -> flood

② pkt -> flood

① put -> flood

[②], [push101, ①], [push 102, ②], [push 103, ③]

[③], [push101, ①], [push 102, ②], [push 103, ③]

[②], [③]

wlc

rtx1200

①

②

③

All each action set can be executed

buckets

(67)

Packet processing flow (1/2)

meter

wlc

rtx1200

cat3750x

①

②

③

id: 0

match: src mac

action: pop vlan,

goto id:1

default: packet-in

group

③ pkt -> flood

② pkt -> flood

① put -> flood

id: 1

default: flood

match: dst mac

action: push vlan,

meter, output

50Mbps

30Mbps

20Mbps

101 aa..aa

ff..ff

from ①

Packet-out

Flow entry

registration

(68)

if dst mac

then push vlan101,

meter 50Mbps,

output①

Packet processing flow (2/2)

meter

wlc

rtx1200

cat3750x

①

②

③

id: 0

match: src mac

action: pop vlan,

goto id:1

default: packet-in

group

③ pkt -> flood

② pkt -> flood

① put -> flood

id: 1

default: flood

match: dst mac

action: push vlan,

meter, output

50Mbps

30Mbps

20Mbps

aa..aa

if src mac

then pop vlan,

goto id:1

flow entry

registaration

(69)