An Architecture for Parallelizing Network Monitoring Based on Multi-Core Processors

(1)

An Architecture for Parallelizing Network Monitoring Based on

Multi-Core Processors

1

Chuan Xu,

2

Weiren Shi,

3

Qingyu Xiong

1, First Author

College of Automation, Chongqing University, China, [email protected]

*2,Corresponding Author

College of Automation, Chongqing University, China, [email protected]

3

College of Automation, Chongqing University, China, [email protected]

doi:10.4156/jcit.vol6.issue4.27

Abstract

Recently, it is becoming increasingly difficult to implement effective systems for real-time network monitoring for large variety of applications including accounting, traffic identification and abnormal detection. The current solutions have to resort to custom capturing hardware that usually comes with high cost, and software-based capturing solutions, such as libpcap, cannot cope with 10Gbps link rates. In this article, we propose an architecture customized for parallel execution of packet analysis using commodity multi-core processor which now broadly implemented in personal computer. On our approach, the packets are dispatched with similar properties to same core and partitioned into several parts, which allows threads maintained in each core for concurrent execution. Numerical results based on real Campus network traffic data are presented to demonstrate the good performance and effectiveness of our system.

Keywords: Network Monitoring, Real-time, Parallel Architecture, Multi-core

1. Introduction

Currently, the constantly growing of network bandwidth and increasingly complex of network services leads to the demand for high-performance network monitoring. Accurate and real-time monitoring is considered to be the keystone in a wide range of network applications such as accounting, traffic identification, intrusion detection and traffic monitoring. One of the main problems of capturing and analyzing packets on high-speed links is the very short period that spends on handing a single packet [1]. At OC-192(10Gbps) and higher speeds, which the packet rate will exceed 10Mpps, this would be a huge challenge to current monitoring technology.

Numerous researches have been proposed: SNMP [2], NetFlow [3], Libpcap [4] and packet sampling [5]. But those methods have a common defect that they can not meet the need of real-time monitoring under 10Gigabit network. For this reason, custom hardware such as FPGA, ASIC and network processors have been widely used to develop high performance monitoring system. Nicholas Weaver et al have developed an in-line, FPGA-based IPS accelerator, using the NetFPGA2 platform [6]. Faisal Khan et al proposed a hardware-software co-designed solution on a Xilinx platform. The advantage of network processor's native parallel efficiency can be taken to achieve real-time packet processing. [8, 9] have implemented their solutions on network processor for network intrusion. Fei He et al have proposed a cluster-based architecture with a stateful traffic splitter for network intrusion prevention system [10] and implemented it on network processor. However, to perform sophisticated network analysis, it has great advantages if we can use general-purpose multi-core CPUs, which are more flexible and inexpensive, rather than custom hardware [11].

Taking advantage of full power of multi-core processors for network monitoring requires an in-depth approach to realize speedups for sophisticated analyses that require fine-grained coordination between multi-CPU’s concurrent threads [12, 13]. The motivation of this study is to fully exploit the power of general-purpose multi-core processors for traffic monitoring. The main work of this paper is to present a novel design and prototypical implementation of the architecture based on multi-core processors for traffic analysis at 10G link speeds.

The rest of the paper is organized as follows. In section 2, we give theoretical analyses for the potential of parallel processing on multi-core processors for network monitoring. Then, we sketch a high-level overview of our parallel architecture in section 3, and the prototypical implementation of

(2)

system is evaluated by conducting experiments using real Campus network data in section 4. Finally, we conclude our work and discuss our future work.

2. System analysis

To date, efforts on exploiting parallelism for network monitoring have focused heavily on multi-core parallelizing analysis. Vern Paxson et al designed the architecture for exploiting multi-core processors to parallelize network intrusion prevention [11], but test result was not given. In [11], a custom device based on FPGA platform serves as the frond-end for dispatching copies of the packets to a set of analysis threads, which are structured as an event-based system. Associating the events with the packets, the system knows how long analyzing for a given packet has completed. But the event mechanism is based on CPU’s interrupt, so rapid packet rate will cause a large mount of interrupts, leading to excessive consumption of system resources.

The improvements of multi-core platform would significantly reduce the cost of packet processing, which makes it possible for real-time traffic monitoring on 10Gigabit network. Here we estimate the packet processing capacity of the multi-core measurement platform which using Intel Xeon 4-cores processor, 4GB DDR3-SDRAM, Linux-2.6.20 64bit Server hardware.

Lets:

1. IC: denotes the CPU clock cycles needed to process a packet;

2. CT: the clock cycles of CPU’s each individual core;

3. N: the number of CPU's cores;

4. M: the packet processing capacity of the measurement platform;

According to Amdahl’s law:

1 (1 ) / Speedup F F N    (1) Where, F is the proportion of a program that can not be made parallelized.

So, the value M can be expressed according to the following Equation:

1 ( ) CT M F IC F N    (2)

Supposing, N = 8, CT = 3.33GHz, IC = 1000, and F < 20%, from Equation (2), M≥ 11.1Mpps; if F

decreases to 10%, M ≥ 14.8Mpps, which closes to the maximum packet forwarding rate of OC-192

(14.88Mpps). It can be concluded from above theoretical analysis that using commodity multi-core processor to achieve 10Gbps network monitoring is feasible.

According Equation (2), to process packet more effectively, the following ways can be used: 1) optimizing the parallel architecture for decreasing the value F, 2) faster algorithms for processing each

packet to decrease the value IC, and 3) more cores for increasing the value N. In this paper, we focus

on how to decrease the value F, the outline of our approach as follows:

a. Dispatching packets to each core equally for improving the load balance between them.

b. Threads that sharing common data need to be run on the same CPU core, thus reducing communication load between them.

c. Improving the cache data accessibility.

3. Overview of the architecture

Figure.1 illustrates our architecture. At the bottom of the diagram is the 10-Gigabit server adapter, which provides the interface to the network. The adapter uses the Direct Access Memory (DMA) mechanism to transfer data without subjecting the CPU to a heavy workload. When an upstream data

(3)

transfer is completed through DMA, the adapter signals it to CPU with an IRQ, and then the Operation System (OS) starts fulfilling standard NAPI procedures.

It is critical to make wise use of such multi-core processors platform, and programs must be specifically designed to have a parallelizable structure. However, not only is it crucial to parallelize the program’s execution structure, but also its memory access patterns.

Network Card Hardware Interrupt Receive DMA CPU Core 1

Main Memory _key

Hash Buffer Record Queue Count event Export event Queue selection Multi-core processor L1 Cache Cache partitions L2 Cache key Pointer data partitions export record to user space External trigger NAPI CPU Core 2 L1 Cache Cache partitions

Calculate key from packet’s features by hash function Key mod the number of cores Select the number of processor core by remainder Queue Queue

Figure 1. Structure of proposed architecture for parallel execution of network monitoring

3.1. Task paralleling

The task paralleling needs to be considered as follows. Firstly, the load balances among multi-cores. Secondly, each task’s executing threads should keep synchronous. Thirdly, the communication load between threads on different processor cores needs to be reduced.

To improve the load balances among multi-cores, we propose a packet selection algorithm, which dispatches the packets with similar properties to the same processor core. The details can be found in Figure 2.

Figure 2. Packet selection algorithm

3.2. Data partition

For reducing communication load and data sharing between threads, a data partition method is proposed to separate the packet data into several partitions equally, which are suitable for high-speed

1. Extracting the five-tuple ‘flow’ features from packet;

2. Calculate hash key on the five-tuple, which includes source IP, port,

destination IP and port;

3. Generate a variable C using hash key according to equation:

C = key MOD N

4. Select one of processor core using C.

5. Dispatch the packet to the C-core, and storage the packet data and hash key in its L2-cache queue.

(4)

cache accessing of each thread. A mutex variable will increase 1 after each thread finished its data partition processing, and it can be used to determine whether the packet processing has been accomplished. Each partition needs 20 bytes (8 bytes pointer links to record, 8 bytes pointer links to function (x), 8 bytes type) head to storage the information which used to link a processing thread. More partitions require more extra storage space.

3.3. Cache

The multi-core processors platform provides L1 cache, L2 cache that shared between cores and mass SDRAM for designers. Taking advantage of L2 shared cache, the system storages the partitioned data in the multi-level cache, the detail is presented in Fig.1.

When the OS starts processing packet data, it firstly transports packet to the corresponding queue in L2 cache through the packet selection algorithm. The queue structured by double linked list, each CPU core has its own queue. The node that contains packet data is inserted into the tail of the queue after the processing through data partitioning algorithm and removed from the head of the queue. If the queue is empty, tasks will be blocked; and if the queue if full, new packet data will be dropped (this only happened if packet rate beyond the capability of hardware). And then, the node data is loaded into L1 cache for processing. The mutex variable that shared between multiple parallel threads is used to decide when the processing will be finished. If it is greater than the number of threads, the processing will be finished and the node data will be deleted. At last, processing results will be written into the record queue which is also structured through double linked list.

A hash buffer is proposed for fast locating records in queue, which occupies a large mount of physic memory (by function-alloc_bootmem_low_pages ()). The pointer that stored in the hash buffer is used to point to the record, which can be found through the hash key.

4. Experiment results

The network monitoring system based on parallel architecture has been implemented and applied, which provides DPI packet identification and abnormal detection. The system have been performed on a server with two Intel Xeon 5504 2.0GHz CPU (8 core), 4GB PC-1333 RAM, Linux 2.6.27-64bit and Intel-EXPX9501-10G adaptor; the code was complied with GCC v4.1.2 with -O3 optimization level.

Figure 3. The network for experimentation.

In Figure.3, Huawei-S9312 provides 72 1Gigabit ports and 4 10Gigabit ports; each 1Gigabit port connects a number of PC hosts; a 10Gigabit port uplink the router to ISP aggregation switcher, whose traffic in both directions is mirrored to another 10Gigabit port where our system collects data from.

4.1. Influence of threads per core

In this section we analyze the influence of number of threads per core on packet processing cost and memory cost in L2-cache. Since the execution time of those processing is rather small, we counted

(5)

1 2 4 8 1 2 4 8 0 2000 4000 6000 8000 10000

Number of thread per core

P a ck e t P ro ce ssi n g C o st ( ti ck s) (I) 1cpu(4core) 2cpu(8core) 1 2 4 8 0 2 4 6 8 10 12 14x 10 4

Number of thread per core

M em o ry C o st i n L2 -C ac h e ( b y te s) (II) Basic Ext ra

Figure 4. Influence of number of threads per core. (I) Packet processing cost. (II) Memory cost. As shown in Figure.4 (I), the number of threads per core has great impact on packet processing cost, but the number of CPU has little effect on it. When the number of threads increases, the packet processing cost decreased. From 1 to 2, the decrease is 20%. From 2 to 4, the decrease reaches 43%. But from 4 to 8, the decrease is only 16%. However, increasing the number of threads per core will consume more extra memory in L2-cache. In Figure.4 (II), when the number of threads reaches 4, it needs only 18% extra memory. But when it increases to 8, the extra memory cost reaches 31%. It can be concluded that four threads per core is more appropriate in this scenario.

4.2. Load balance between multi-cores

The load balancing among multi-cores (memory usage in L2-cache and CPU usage) are measured in this section, which are affected by packet selection algorithm mentioned in section 3.1.

1 2 3 4 5 6 7 8 0 50 100 150 200 250 Cores R e co rd s i n L 2 -C a ch e (a v e ra g e ) (I) 1 2 3 4 5 6 7 8 0 10 20 30 Cores C P U u sag e ( % ) (II)

Figure 5. Load balancing. (I) Records in L2-cache. (II) CPU usage.

Figure.5 (I) shows the number of packets recorded in each core’s L2-cache where packets are dispatched by the packet selection algorithm. The average value is 173; the standard deviation is 41.58. Except for the first and fourth core, packets are distributed relatively equal in other cores. Similarly, as shown in Figure.5 (II), the average value of CPU usage is 21.9% and the standard deviation is 4.02%. Despite the greater deviation on fourth core, other cores have uniform CPU usages.

4.3. Performance evaluation

This section reports the experimental analysis of performance, which mainly contains the CPU and memory usages. In order to evaluate the effectiveness of our parallel architecture, measurements were executed with two sets of real traces which are given in Table I. In Set2, the system is fully applied with our parallel architecture; while in Set1, the system is not applied the task paralleling.

(6)

Table 1. Data sets

Data sets Date Period Num Apply

Set1

Set2 SaturdayFriday 20secs20secs 4320 4320 without parallelwith parallel

0 2 4 6 8 10 12 14 16 18 20 22 24 0 1 2 3 (I-a)

Dayt ime hour

M o n ito re d P a ck e t R a te ( M p p s) 0 2 4 6 8 10 12 14 16 18 20 22 24 0 1 2 3 (I-c)

Dayt ime hour

M oni tor e d P a ck e t R a te ( M pps ) 0 2 4 6 8 10 12 14 16 18 20 22 24 0 1 2 3 (II-a) M o n ito re d P a ck e t R a te ( M p p s) 0 2 4 6 8 10 12 14 16 18 20 22 24 0 2 4 6 8 10 12 14.4 (II-b)

Dayt ime hour

M oni tor e d P a ck e t R a te ( M pps ) 0 2 4 6 8 10 12 14 16 18 20 22 24 0 1 2 3 (II-c)

Dayt ime hour

M oni tor e d P a ck e t R a te ( M pps ) 0 2 4 6 8 10 12 14 16 18 20 22 240 30 60 90 CP U u sa g e ( % ) Packet CPU 0 2 4 6 8 10 12 14 16 18 20 22 2410 15 20 25 Mem o ry u sag e (% ) 0 2 4 6 8 10 12 14 16 18 20 22 24 0 2 4 6 8 10 12 14.4 (I-b)

Dayt ime hour

M o n ito re d P a ck e t R a te ( M p p s) 0 2 4 6 8 10 12 14 16 18 20 22 240 30 60 90 CP U u sa g e ( % ) 0 2 4 6 8 10 12 14 16 18 20 22 240 10 20 30

Dayt ime hour

CP U u sa g e ( % ) Packet CPU Packet Memory Packet CPU 0 2 4 6 8 10 12 14 16 18 20 22 240 13.9 27.8 41.7 55.6 69.5 83.4 100 CP U u sa g e ( % ) Packet CPU 0 2 4 6 8 10 12 14 16 18 20 22 2410 15 20 25 Mem o ry u sa g e (% )

Figure 6. The usages of CPU and memory with and without parallelization: I - set1 and II - set2 The distributions of CPU usage and packet rate in the two groups of real traces are shown in Figure.6 (I-a) and (II-a), the CPU usage reduced greatly when parallel architecture applied, the CPU usage of Set1 is more than twice that of Set2. Figure.6 (I-b) shows that, without applying parallel architecture, the CPU usage is fluctuating and unstable. However, as shown in Figure.6 (I-c) and (II-c), the memory usage of set2 increases 3% than that of set1, but it keeps stable.

As shown in Figure.6 (II-b), with our parallel architecture applied, the curve of packet processing rate is completely covered by the curve of CPU usage. Surprisingly, we suppose that the CPU usage will reaches 100% when the packet rate achieves 14.4Mpps - nearly the maximum packet forwarding rate of OC-192 (14.88Mpps). Now, the packet length on Internet is about 450 bytes. When processing rate of our system reaches its maximum (2.8Mpps), corresponding link bandwidth can be calculated as 2.8Mpps*450bytes = 10.8Gbps, and the CPU usage is only 25% then. Although the performance of the CPU usage can not be fully tested under current conditions, but it can be inferred that the proposed parallel architecture can theoretically support real-time 10Gbps network monitoring.

5. Conclusion

In this paper, we give an architecture for traffic monitoring system, by fully exploiting the parallel power of general-purpose multi-core processor. The contributions of our work are as follows: 1) an architecture using commodity multi-core processor is proposed for real-time monitoring on 10Gigabit network. 2) a packet selection algorithm, dispatching the packets with similar properties to the same processor core for improving load balance between multi-cores, is proposed. 3) a data partition method, separating the packet equally into several partitions for reducing overload on communication and data sharing between threads, is presented.

We implemented a prototypical traffic monitoring system based on our architecture using a standard server PC, and evaluated the system performance. The system is tested on campus network, and the results show that the system can meet the 10Gbps network’s monitoring need.

(7)

Our future works are: 1) Owing to our test environment limitation, the full capability of our system has not been known yet. Next, commercial packet generators could be used to make huge mount of different length packets to test the capacity of our system. 2) For real-time monitoring the higher network, such as 40Gigabit network, a new cache algorithm should be studied for reducing the potential huge memory need while not increasing much burden on processing.

6. Acknowledgements

This work was funded by the National Natural Science Foundation of China under Grant (No.60873079 and No. 61040044). The authors would like to thank the network center of Chongqing University for supporting the real traffic traces on Internet backbone.

7. References

[1] R.G.Clegg, M.S.Withall, and A.W.Moore, “Challenges in the capture and dissemination of measurements from high-speed networks”, IET Communications, vol. 3, no. 6, pp.957-966, 2009. [2] Schonwalder, Jurgen Marinov, and Vladislav, “On the Impact of Security Protocols on the

Performance of SNMP”, Network and Service Management, IEEE TRANSACTIONS, vol. 8, no. 1, pp.52-64, 2011.

[3] C. Estan, K. Keys, D. Moore, and G. Varghese, “Building a Better NetFlow”, ACM SIGCOMM, pp.245-256, 2004.

[4] libpcap-PFRING Homepage, “http://www.ntop.org/PF_RING.html”, February 2008.

[5] N. Duffield, C. Lund, and M. Thorup, “Flow sampling under hard resource constraints”, In Proceedings of ACM SIGMETRICS-Performance, pp.85-96, 2004.

[6] Nicholas Weaver, Vern Paxson, and Jose M. Gonzalez, “An FPGA-based accelerator for network intrusion prevention”, In Proceedings of the ACM Symposium on Field Programmable Gate Arrays, pp.199-206, 2007.

[7] Faisal Khan, Lihua Yuan, Chen-Nee Chuah and Soheil Ghiasi, “A programmable architecture for scalable and real-time network traffic measurements”, In Proceedings of the 4th_ACM/IEEE

Symposium on Architectures for Networking and Communications Systems, pp.109-118, 2008. [8] Xiang Wang, Yaxuan Qi, Yibo Xue and Jun Li, “Towards High-Performance Network Intrusion

Prevention System on Multi-core Network Service Processor”, In Proceedings of the 15th IEEE International Conference on Parallel and Distributed System, pp.220-227, 2009.

[9] Fei He, Yaxuan Qi, Yibo Xue and Jun Li, “YACA: Yet Another Cluster-based Architecture for Network Intrusion Prevention”, In Proceedings of IEEE GLOBECOM, pp.1-5, 2010.

[10]Wen-Yew Liang, Chi-Yu Weng, Yen-Lin Chen and Che Wun Chiou, “Design of a Parallel Face Detection Algorithm for Distributed Low Cost IP-based Surveillance Systems”, JCIT, Vol. 6, No. 2, pp. 306-318, 2011.

[11]Vern Paxson, Robin Sommer, and Nicholas Weaver, “An Architecture for Exploiting Multi-Core Processors to Parallelize Network Intrusion Prevention”, In Proceedings of IEEE Sarnoff Symposium, pp.1-7, 2007.

[12]Yongheng Chen, Wanli Zuo, Fengling He, Kerui Chen, “Optimization Strategy of Parallel Query Processing Based on Multi-core Architecture”, JCIT, Vol. 5, No. 8, pp. 21-25, 2010.

[13]Alaa M. Al-Obaidi, Sai Peck Lee, “A Concurrent Multithreaded Scheduling Model for Solving Fibonacci Series on Multicore Architecture”, IJACT, Vol. 3, No. 2, pp. 24-37, 2011.