Packet-based Network Traffic Monitoring and Analysis with GPUs

(1)

Packet-based Network Traffic

Monitoring and Analysis with GPUs

Wenji Wu, Phil DeMar

wenji@fnal.gov , demar@fnal.gov GPU Technology Conference 2014

March 24-27, 2014 SAN JOSE, CALIFORNIA

(2)

Background

• Main uses for network traffic monitoring & analysis tools:

– Operations & management – Capacity planning

– Performance troubleshooting

• Levels of network traffic monitoring & analysis:

– Device counter level (snmp data) – Traffic flow level (flow data)

– Packet level (The Focus of this work)

• security analysis

• application performance analysis

• traffic characterization studies

(3)

Characteristics of packet-based network monitoring & analysis applications

• Time constraints on packet processing.

• Compute and I/O throughput-intensive

• High levels of data parallelism.

– Packet parallelism.

Each packet can be processed independently

– Flow parallelism.

Each flow can be processed independently

• Extremely poor temporal locality for data

– Typically, data processed once in sequence; rarely reused

Background ^(cont.)

(4)

Packet-based traffic monitoring & analysis tools face performance & scalability challenges within high-

performance networks.

– High-performance networks:

• 40GE/100GE link technologies

• Servers are 10GE-connected by default

• 400GE backbone links & 40GE host connections loom on the horizon.

– Millions of packets generated & transmitted per sec

The Problem

(5)

• Requirements on computing platform for high

performance network monitoring & analysis applications:

– High Compute power

– Ample memory bandwidth

– Capability of handing data parallelism inherent with network data

– Easy programmability

Monitoring & Analysis Tool Platforms (I)

(6)

• Three types of computing platforms:

– NPU/ASIC – CPU

– GPU

Monitoring & Analysis Tool Platforms (II)

Features NPU/ASIC CPU GPU

High compute power ^Varies ^✖ ^✔

High memory bandwidth ^Varies ^✖ ^✔

Easy programmability ^✖ ^✔ ^✔

Data-parallel execution model ^✖ ^✖ ^✔

Architecture Comparison

(7)

Our Solution

Use GPU-based Traffic Monitoring & Analysis Tools Highlights of our work:

• Demonstrated GPUs can significantly accelerate network traffic monitoring &

analysis

– 11 million+ pkts/s without drops (single Nvidia M2070)

• A generic I/O architecture to move network traffic from wire into GPU domain

• Implemented a GPU-accelerated library for network traffic capturing, monitoring, and analysis.

– Dozens of CUDA kernels, which can be combined in a variety of ways to perform

monitoring and analysis tasks

(8)

Key Technical Issues

• GPU’s relatively small memory size:

– Nvidia M2070 has 6 GB Memory, K10 has 8 GB Memory – Workarounds:

• Mapping host memory into GPU with zero-copy technique? ✖

• Partial packet capture approach ✔

• Need to capture & move packets from wire into GPU domain without packet loss

• Need to design data structures that are efficient for both CPU and

GPU

(9)

System Architecture

...

1. Trafﬁc Capture

2. Preprocessing GPU Domain

Monitoring & Analysis Kernels

Output

User Space

Output Output

3. Monitoring & Analysis

4. Output Display

Packet Buffer

Network Packets NICs

Packet

Buffer Output

...

Capturing Captured

Data

Packet Chunks

• Traffic Capture

• Preprocessing

Four Types of Logical Entities:

• Monitoring & Analysis

• Output Display

(10)

Packet I/O Engine

Free Packet Buffer Chunks

OS Kernel User Space

...

Capture

Attach

Recycle

Recv Descriptor Ring Packet Buffer Chunk

Incoming Packets NIC

...

Descriptor Segments

Processing Data

Key techniques

• A novel ring-buffer-pool mechanism

• Pre-allocated large packet buffers

• Packet-level batch processing

• Memory mapping based zero-copy

Key Operations

• Open

• Capture

• Recycle

• Close

Goal:

• To avoid packet capture loss

(11)

GPU-based Network Traffic Monitoring &

Analysis Algorithms

• A GPU-accelerated library for network traffic capturing, monitoring, and analysis apps.

– Dozens of CUDA kernels

– Can be combined in a variety of ways to perform

intended monitoring & analysis operations

(12)

Packet-Filtering Kernel

1 raw_pkts [ ]

ﬁltered_pkts [ ] ﬁltering_buf [ ]

scan_buf [ ]

index

0 1 2 3 4 5 6 7

index

0 1 2 3

Filtering 1

Scan 2

Compact 3 p2

p1 p3 p4 p5 p6 p7 p8

0 1 1 0 0 1 0

0 1 1 2 3 3 3 4

p1 p3 p4 p7

x x x x

Advanced packet filtering capabilities at wire speed are necessary so that we only analyze those packets of interest to us.

We use Berkeley Packet Filter (BPF) as the packet filter

A few basic GPU operations, such

as sort, prefix-sum, and compact.

(13)

Sample Use Case -1

Using our GPU-accelerated library, we developed a sample use case to monitor network status:

• Monitor networks at different levels:

– from aggregate of entire network down to one node

• Monitor network traffic by protocol

• Monitor network traffic information per node:

– Determine who is sending/receiving the most traffic – For both local and remote addresses

• Monitor IP conversations:

– Characterizing by volume, or other traits. Packet Parallelism

(14)

Sample Use Case 1 – Data Structures

Three key data structures were created at GPU:

• protocol_stat[]

– an array that is used to store protocol statistics for network traffic, with each entry associated with a specific protocol.

• ip_snd[] and ip_rcv[]

– arrays that are used to store traffic statistics for IP conversations in the send and receive directions, respectively.

• ip_table

– a hash table that is used to keep track of network traffic information of each IP address node.

These data structures are designed to reference themselves and each other with

relative offsets such as array indexes.

(15)

Sample Use Case 1 – Algorithm

1. Call Packet-filtering kernel to filter packets of interest 2. Call Unique-IP-addresses kernel to obtain IP addresses 3. Build the ip_table with a parallel hashing algorithm

4. Collect traffic statistics for each protocol and each IP node

5. Call Traffic-Aggregation kernel to build IP conversations

(16)

Sample use case 2

Using our GPU-accelerated library, we developed a sample use case to monitor bulk data transfer:

• Monitor bulk data transfer rate

• from aggregate of entire network down to one node

• Analyze data transfer performance bottlenecks

– Sender-limited, network-limited, or receiver-limited?

– Network-limited?

• Packet drops, or packet reordering?

Packet Parallelism +

Flow Parallelism

(17)

Sample Use Case 2 – Algorithm

Monitoring Mode:

1. Call Packet-filtering kernel to filter packets of interest

2. Call TCP-throughput kernel to calculate data transfer rates Analyzing Mode:

1. Call Packet-filtering kernel to filter packets of interest 2. Classify packets into flows

3. Call TCP-analysis kernel to analyze performance bottlenecks

(18)

Prototyped System

P0 P1

I O H

I O H 10G-NIC

M2070

Mem Mem QPI

QPI

PCI-E 10G-NIC

PCI-E

Node 0 Node 1

PCI-E

Prototyped System

• Our application is developed on Linux.

• CUDA 5.0 programming environment.

• The packet I/O engine is implemented on Intel 82599-based 10GigE NIC

• A two-node NUMA system

• Two 8-core 2.67GHz Intel X5650 processors.

• Two Intel 82599-based 10GigE NICs

• One Nvidia M2070 GPU.

(19)

Performance Evaluation Packet I/O Engine

Packet drop rate – one-nic-exp Packet drop rate – two-nic-exp

• Our Packet I/O engine (GPU-I/O) achieves better performance than NETMAP and PSIOE in avoiding packet capture loss.

0 0.1 0.2 0.3 0.4 0.5

1000 10000 100000 1000000 10000000

Packetdroprate

Number of packets

psioe-1.6 psioe-2.0 psioe-2.4

netmap-1.6 netmap-2.0 netmap-2.4

gpu-i/o-1.6 gpu-i/o-2.0 gpu-i/o-2.4

(20)

Performance Evaluation

GPU-based Packet Filtering Algorithm

0 20 40 60 80 100 120 140 160 180

mmap-gpu s-cpu-1.6 s-cpu-2.0 s-cpu-2.4 m-cpu-1.6 m-cpu-2.0 m-cpu-2.4 st-gpu Throughput(Millionp/s)

Data-set-1 Data-set-2

Data-set-3 Data-set-4

(21)

Performance Evaluation

GPU-based Sample Use Case - 1

0 10 20 30 40 50 60 70 80 90 100

IP udp net 129.15.30 net 131.225.107 Throughput(Millionp/s)

s-cpu-1.6G s-cpu-2.0G s-cpu-2.4G

m-cpu-1.6 m-cpu-2.0 m-cpu-2.4

st-gpu

(22)

50 100 150 200 250 300 0.4

0.6 0.8 1

Packet Size (Byte)

Packet Ratio

One NIC Experiments

50 100 150 200 250 300

0.4 0.6 0.8 1

Packet Size (Byte)

Packet Ratio

Two NIC Experiments

1.6GHz 2.0GHz 2.4GHz OnDemand

Performance Evaluation – Overall System Performance

In the experiments:

• Generators transmitted packets at wire rate.

• Packet sizes are varied across the experiments.

• Prototype system run in full operation with sample use case 1.

(23)

Key Take Aways

• GPUs accelerate network traffic monitoring & analysis.

• A generic I/O architecture to move network traffic from wire into GPU domain.

• A GPU-accelerated library for network traffic capturing, monitoring, and analysis.

– Dozens of CUDA kernels, which can be combined in various ways to

perform monitoring and analysis tasks.

(24)

Packet-based Network Traffic Monitoring and Analysis with GPUs