Toward line rate Traffic Classification

(1)

1

Toward line rate Traffic Classification

Niccolo' Cascarano

Politecnico di Torino

(2)

Background

• In the last years many new traffic classification algorithms

based on statistical approach

• One of the claims of these new algorithms is that their

computational requirements are lows than “Deep Packet

Inspection” [3-8]

• DPI is commonly considered too expensive

Is that true?

Can DPI be further improved?

Is there anything better than DPI?

(3)

3

The path toward the answers

• Create a model of some classifiers (currently, DPI, Naïve

Bayes and SVM) and compare their complexity

–

Joint work with Università di Brescia

• Improve the DPI engine itself

(4)

Question 1: is DPI so computationally

complex?

(5)

5

What is DPI?

• DPI = pattern matching through regular expressions

• Two main flavors:

–

Packet-Based per-Flow State (PBFS): network data are analyzed

on a packet-by-packet basis as soon packets are received by the

classifier

–

Message-Based per-Flow State (MBFS): network data are analyzed

as an unique stream of data after TCP/IP normalization

• PBFS seems roughly equivalent MBFS

with respect to traffic

classitication

[1-2]

• We use

PBFS DPI classifier

+ capability to analyze correlated

session (e.g., FTP and SIP)

(6)

Methodology

• Cost modeling

–

Average cost per packet (instead of worst-case)

• Modeled each classifier

• Derived the cost of each block

• Determined the transition probability from one block to the

other by analyzing real traces (with ground truth [26])

• Derived the min/max/average cost per packet

(7)

7

Models

DPI

SVM

• Session ID Extracion

extracts the

L3 and L4 information from network

packets

• Session lookup

checks within the

“session table” if a packets belongs

to a classified session

• Pattern matching

implements the

pattern matching algorithm (DPI only)

• SVM decision

implements the

SVM classification algorithm (SVM

only)

• Session update

updates the

“session table” with the outcome of

the classification

• Correlated session

it analyzes the

application data for obtaining

information on correlated sessions

(DPI only)

(8)

Basic blocks implementation

• Session ID extraction

: native assembly code for IA32 generated

NetVM framework [19]

• Session Lookup

e

Session Update

: C++ code using

hash_map

container of extended STL C++ library [18]

• Pattern matching

: C++ code implementing a DFA-based

algorithm generated by Flex [20]. About 30 application protocol

are recognized (NOTE: the cost of this block

does NOT depend

on

the number of protocol recognized)

• SVM Decision

: C++ code written exploiting the multivariate

Gaussian joint density function. We generated the models for

recognizing about 10 application protocols. (NOTE: the cost of this

block linearly

DEPENDS

on the number of protocol recognized)

• Correlated Session

: C++ code written on purpose deriving

correlated session rules for FTP and SIP protocol from the NetPDL

database [17]

(9)

9

Experimental evaluation

• Costs of each block measured with the RDTSC instruction

• Costs dependent on the input traffic (e.g. DFA) is further

characterized in order to push relevant parameters in the

final formula

• Traffic traces

–

UNIBS

trace contains a big percentage of p2p traffic, known

to be challenging for DPI classifiers

–

POLITO

trace contains a medium size campus network traffic

trace (~6000 hosts within the network)

(10)

Absolute costs of each basic block

• Pattern matching depends on the packet size

(11)

11

Comparison

(12)

Comparison

• Legend

–

Best case: all the packets belong to already classified sessions

(fast path)

–

Worst case: all the packets need to take the slow path

–

Average case: the costs are normalized using the execution

probabilities of each basic block

• Results

–

DPI classifier has the same order of magnitude of the other

ones, even for UNIBS challenging trace

–

May be better on some traces

(13)

13

Conclusion 1

• Packet-based DPI may not be as complex as we thought,

(14)

Question 2: can we reduce DPI cost?

(15)

15

Yes, We Can



• … if we focus on traffic classification and not network

security

(16)

(1) Use fast algorithms

Min (ticks) Avg (ticks)

Max (ticks)

Flex (canonical DFA)

76 3980

19147

PCRE (NFA-based)

35.7K

2.08M

9.16M

DFA is simple and O(payload_length)

Key question: is the DFA usable?

(17)

17

(2) Use “friendly” regular expressions

(18)

(2) … and convert some in “friendly”

Baseline: not anchored + Kleene

http 

unknown

unknown 

http

Anchored (on UNIBS-GT)

0%

Anchored + Kleene (on UNIBS-GT)

0%

Anchored (on POLITO)

0.004%

0.38%

Average cost on HTTP

Match (ticks)

No match

Anchored

1663

1415

Anchored + Kleene

5622

1367

Not anchored + Kleene

5503

3300

(19)

19

(3) Use a packet-based approach

Unknown TCP

traffic

Additional classified

TCP traffic

POLITO

23.5GB

2.6MB

(20)

(4) Snapshot-based classification

(21)

21

(4) Snapshot-based classification

Fair speedup with

TCP traffic

(22)

(5) Limiting classification attempts

Avg # pkts

Std dev

UNIBS-GT (TCP)

654 4619

POLITO-GT (TCP)

563 3659

POLITO (TCP)

68 1879

UNIBS-GT (UDP)

2.62

0.71 POLITO-GT (UDP)

6.05

26.4 POLITO (UDP)

9.17

476 Avg # pkts

Std dev

Bittorrent (TCP)

1

0 Samba (TCP)

1.01

0.29 HTTP (TCP)

1.05

15.6 Skype (UDP)

1.7

437 SSL(UDP)

1.92

267

(23)

23

(5) Limiting classification attempts

(24)

(5) Limiting classification attempts

(25)

25

(5) Limiting classification attempts

Possible high

speedup with TCP

(26)

(4)+(5) Snapshot + Attempts limit

(27)

27

Conclusions 2

• DFA is OK for traffic classification

–

Fast algorithms

–

Up to 3 orders of magnitude

• “friendly” regex

–

May achieve up to 5 times speedup

• No message-based processing

• Snapshot = 256 for UDP and fair attempts limit (e.g. 10)

–

Fairly small packets; signature that operate on packet sequences

• Strict attempt limit for TCP (N=2)

–

Able to catch response packets

• A speedup of 15 on results in Conclusion1 gives 20Mpps on a 3GHz

CPU

(28)

Addendum

• What are regex?

• We usually assume regex= regular expressions (e.g. PERL)

• We believe this model is not powerful enough to cope with

modern traffic classification

• We have to think about a more extended model

–

E.g. currently Skype and RTP are detected with some

imperative code in addition to regex

(29)

29

Is there anything better than DPI?

(30)

Better perhaps no, but…

• Service-Based Traffic Classification is surely an answer

• Not exactly a replacement of DPI

• Instead, something orthogonal to (I would like to say most)

traffic classification approaches

• Service-Based Classification:

Once you associated (IP, port) with Service S, all established

sessions that insist on that endpoint are associated to S

(31)

31

Service-Based Traffic Classification

• No further details are provided in this presentation

• However, a lot of analysis done that confirm that it really

works

• By-product: if the first classification is correct, a lot of more

traffic classified

–

A service with a few sessions in clear and most encrypted

traffic

(32)

SBC: Services vs. sessions

0 20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

0

20

40

60

80

100

120

140

160 Time (hours)

Services

Sessions

(33)

33

Conclusions

• DPI well-known limit is encrypted sessions

–

No way to cope with that with DPI alone

• DPI (for traffic classification) may not be so costly compared to

other competitors and have many advantages

–

E.g. no training (regex are “simple” to derive)

–

Simple implementation

–

Most of time, walks over small portions of DFA (in cache)

• Service-Based Classification may be a good complement of

previous solutions

• My 2c: statistical traffic classifiers may have a better fit with a

limited number of protocols (i.e. if you want to identify just

P2P) but are not applicable to hundreds of protocols

(34)

Questions?

(35)

35

References

[1] A. Moore, K. Papagiannaki, Toward the Accurate Identiﬁcation of Network Application, 6th International Workshop on Passive and Active Network Measurement,Boston MA, USA, May 2005, pp. 41-54.

[2] F. Risso, A. Baldini, M. Baldi, P. Monclus, O. Morandi, Lightweight, Payload-Based Trafﬁc Classiﬁcation: An Experimental Evaluation, IEEE International Conference on Communications (ICC 2008), Beijing (China), pp. 5869-5875, May 2008.

[3] J. Erman, A. Mahanti, M. Arlitt, C. Williamson, Identifying and discriminating between web an peer-to-peer trafﬁc in the network core, Proceedings of the 16th International Conference on World Wide Web, Banff, Alberta, Canada pp. 883 - 892, 2007.

[4] J. Erman, M. Arlitt, A. Mahanti, Trafﬁc classiﬁcation using clustering algorithms, Proceedings of the 2006 SIGCOMM, Pisa, Italy, pp. 281 - 286, 2006.

[5] L. Bernaille, R. Teixeira, I. Akodkenou, Traffic classification on the fly, 4th ACM/IEEE Symposium on Architectures for Networking and Communications Systems, San Jose, CA, pp. 40-49, 2008.

[6] S. Zander, T. Nguyen, G. Armitage, Self-learning IP traffic classification based on statistical flow characteristics, International Workshop on Passive and Active Network Measurement, Boston MA, pp. 325-328, 2005.

[7] M. Crotti, M. Dusi, F. Gringoli, L. Salgarelli, Trafﬁc Classiﬁcation through Simple Statistical Fingerprinting, ACM SIGCOMM Computer Communication Review, Vol. 37, No. 1, pp. 5-16, Jan. 2007.

[8] L. Bernaille, R. Teixeira, K. Salamatian, Early Application Identification, 2nd CoNEXT Conference, Lisboa, Portugal, Dec. 2006. [9] A. Este, F. Gringoli, L. Salgarelli, Support Vector Machines for TCP Traffic Classification, Universit` degli Studi di Brescia, Technical Report a. 08-07, Jul. 2008.

[10] N. Williams and S. Zander and G. Armitage, A Preliminary Performance Comparison of Five Machine Learning Algorithms for Practical IP Trafﬁc Flow Classiﬁcation, SIGCOMM Computer Communication Review, Vol. 36, No. 5, , pp. 7-15, Oct. 2006.

[11] H. Kim, Kc Claffy, M. Fomenkova, D. Barman and M. Faloutsos, Internet Traffic Classification Demystified: The Myths, Caveats and Best Practices, ACM CoNEXT, Madrid, Spain, Dec. 2008.

(36)

References

[13] T. Karagiannis, K, Papagiannaki, M. Faloutsos, BLINC: Multilevel trafﬁc classiﬁcation in the Dark, ACM SIGCOMM, Aug. 2005.

[14] A. Este, F. Gargiulo, F. Gringoli, L. Salgarelli, C. Sansone, Pattern Recognition Approaches for Classifying IP Flows, 7th International Workshop on Statistical Pattern Recognition, Orlando, FL, Dec. 2008.

[15] V.N. Vapnik, Statistical Learning Theory. John Wiley and Sons, New York, 1998.

[16] B. Scholkopf, J.C. Platt, J. Shawe–Taylor, A.J. Smola, R.C. Williamson, on Estimating the Support of a High–Dimensional Distribution. Neural Computation, 13, pp. 1443–1471, 2001.

[17] Computer Networks Group (NetGroup) at Politecnico di Torino. The NetBee Library. August 2004. [online] Available at http://www.nbee.org/.

[18] Hash map container reference, http://www.sgi.com/tech/stl/hash map.html

[19] O. Morandi, F. Risso, M. Baldi, A. Baldini, Enabling ﬂexible protocol processing through dynamic code generation, International Conference on Communications, Beijing (China), pp. 5849 - 5856, May 2008.

[20] ﬂex: The Fast Lexical Analyzer, http://ﬂex.sourceforge.net/

[21] R. Smith, C. Estan, S. Jha, S. Kong, Deﬂating the big bang: fast and scalable deep packet inspection with extended ﬁnite automata, ACM SIGCOMM Computer Communication Review, Volume 38 , Issue 4 (October 2008), Pages 207-218.

[22] M. Becchi, P. Crowley, Efﬁcient regular expression evaluation: Theory to pratice, Symposium On Architecture For Networking And Communications Systems, Proceedings of the 4th ACM/IEEE Symposium on Architectures for Networking and Communications Systems, San Jose, California, Pp. 50-59, 2008.

[23] S. Kumar, S. Dharmapurikar, F. Yu, P. Crowley, J. Turner, Algorithms to accelerate multiple regular expressions matching for deep packet inspection, ACM SIGCOMM Computer Communication Review, Volume 36, Issue 4, pp. 339 - 350, October 2006

[24] File Transfer Protocol (FTP), RFC 959, http://www.ietf.org/rfc/rfc959.txt

[25] N. Brownlee, Trafﬁc ﬂow measurement: Meter MIB, Request for Comments RFC 2064, Internet Engineering Task Force, January 1997. [26] F. Gringoli, L. Salgarelli, M. Dusi, N. Cascarano, F. Risso, K.C. Claffy, GT: picking up the truth from the ground for Internet traffic,