1
Toward line rate Traffic Classification
Toward line rate Traffic Classification
Niccolo' Cascarano
Politecnico di Torino
Background
Background
•
In the last years many new traffic classification algorithms
based on statistical approach
•
One of the claims of these new algorithms is that their
computational requirements are lows than “Deep Packet
Inspection” [3-8]
•
DPI is commonly considered too expensive
Is that true?
Can DPI be further improved?
Is there anything better than DPI?
3
The path toward the answers
The path toward the answers
•
Create a model of some classifiers (currently, DPI, Naïve
Bayes and SVM) and compare their complexity
–
Joint work with Università di Brescia
•
Improve the DPI engine itself
Question 1: is DPI so computationally
Question 1: is DPI so computationally
complex?
5
What is DPI?
What is DPI?
•
DPI = pattern matching through regular expressions
•
Two main flavors:
–
Packet-Based per-Flow State (PBFS): network data are analyzed
on a packet-by-packet basis as soon packets are received by the
classifier
–
Message-Based per-Flow State (MBFS): network data are analyzed
as an unique stream of data after TCP/IP normalization
•
PBFS seems roughly equivalent MBFS
with respect to traffic
classitication
[1-2]
•
We use
PBFS DPI classifier
+ capability to analyze correlated
session (e.g., FTP and SIP)
Methodology
Methodology
•
Cost modeling
–
Average cost per packet (instead of worst-case)
•
Modeled each classifier
•
Derived the cost of each block
•
Determined the transition probability from one block to the
other by analyzing real traces (with ground truth [26])
•
Derived the min/max/average cost per packet
7
Models
Models
DPI
SVM
•
Session ID Extracion
extracts the
L3 and L4 information from network
packets
•
Session lookup
checks within the
“session table” if a packets belongs
to a classified session
•
Pattern matching
implements the
pattern matching algorithm (DPI only)
•
SVM decision
implements the
SVM classification algorithm (SVM
only)
•
Session update
updates the
“session table” with the outcome of
the classification
•
Correlated session
it analyzes the
application data for obtaining
information on correlated sessions
(DPI only)
Basic blocks implementation
Basic blocks implementation
•
Session ID extraction
: native assembly code for IA32 generated
NetVM framework [19]
•
Session Lookup
e
Session Update
: C++ code using
hash_map
container of extended STL C++ library [18]
•
Pattern matching
: C++ code implementing a DFA-based
algorithm generated by Flex [20]. About 30 application protocol
are recognized (NOTE: the cost of this block
does NOT depend
on
the number of protocol recognized)
•
SVM Decision
: C++ code written exploiting the multivariate
Gaussian joint density function. We generated the models for
recognizing about 10 application protocols. (NOTE: the cost of this
block linearly
DEPENDS
on the number of protocol recognized)
•
Correlated Session
: C++ code written on purpose deriving
correlated session rules for FTP and SIP protocol from the NetPDL
database [17]
9
Experimental evaluation
Experimental evaluation
•
Costs of each block measured with the RDTSC instruction
•
Costs dependent on the input traffic (e.g. DFA) is further
characterized in order to push relevant parameters in the
final formula
•
Traffic traces
–
UNIBS
trace contains a big percentage of p2p traffic, known
to be challenging for DPI classifiers
–
POLITO
trace contains a medium size campus network traffic
trace (~6000 hosts within the network)
Absolute costs of each basic block
Absolute costs of each basic block
•
Pattern matching depends on the packet size
11
Comparison
Comparison
Comparison
•
Legend
–
Best case: all the packets belong to already classified sessions
(fast path)
–
Worst case: all the packets need to take the slow path
–
Average case: the costs are normalized using the execution
probabilities of each basic block
•
Results
–
DPI classifier has the same order of magnitude of the other
ones, even for UNIBS challenging trace
–
May be better on some traces
13
Conclusion 1
Conclusion 1
•
Packet-based DPI may not be as complex as we thought,
Question 2: can we reduce DPI cost?
15
Yes, We Can
Yes, We Can
•
… if we focus on traffic classification and not network
security
(1) Use fast algorithms
(1) Use fast algorithms
Min (ticks) Avg (ticks)
Max (ticks)
Flex (canonical DFA)
76
3980
19147
PCRE (NFA-based)
35.7K
2.08M
9.16M
DFA is simple and O(payload_length)
Key question: is the DFA usable?
17
(2) Use “friendly” regular expressions
(2) Use “friendly” regular expressions
(2) … and convert some in “friendly”
(2) … and convert some in “friendly”
Baseline: not anchored + Kleene
http
unknown
unknown
http
Anchored (on UNIBS-GT)
0%
0%
Anchored + Kleene (on UNIBS-GT)
0%
0%
Anchored (on POLITO)
0.004%
0.38%
Average cost on HTTP
Match (ticks)
No match
Anchored
1663
1415
Anchored + Kleene
5622
1367
Not anchored + Kleene
5503
3300
19
(3) Use a packet-based approach
(3) Use a packet-based approach
Unknown TCP
traffic
Additional classified
TCP traffic
POLITO
23.5GB
2.6MB
(4) Snapshot-based classification
21
(4) Snapshot-based classification
(4) Snapshot-based classification
Fair speedup with
TCP traffic
(5) Limiting classification attempts
(5) Limiting classification attempts
Avg # pkts
Std dev
UNIBS-GT (TCP)
654
4619
POLITO-GT (TCP)
563
3659
POLITO (TCP)
68
1879
UNIBS-GT (UDP)
2.62
0.71
POLITO-GT (UDP)
6.05
26.4
POLITO (UDP)
9.17
476
Avg # pkts
Std dev
Bittorrent (TCP)
1
0
Samba (TCP)
1.01
0.29
HTTP (TCP)
1.05
15.6
Skype (UDP)
1.7
437
SSL(UDP)
1.92
267
23
(5) Limiting classification attempts
(5) Limiting classification attempts
25
(5) Limiting classification attempts
(5) Limiting classification attempts
Possible high
speedup with TCP
(4)+(5) Snapshot + Attempts limit
27
Conclusions 2
Conclusions 2
•
DFA is OK for traffic classification
–
Fast algorithms
–
Up to 3 orders of magnitude
•
“friendly” regex
–
May achieve up to 5 times speedup
•
No message-based processing
•
Snapshot = 256 for UDP and fair attempts limit (e.g. 10)
–
Fairly small packets; signature that operate on packet sequences
•
Strict attempt limit for TCP (N=2)
–
Able to catch response packets
•
A speedup of 15 on results in Conclusion1 gives 20Mpps on a 3GHz
CPU
Addendum
Addendum
•
What are regex?
•
We usually assume regex= regular expressions (e.g. PERL)
•
We believe this model is not powerful enough to cope with
modern traffic classification
•
We have to think about a more extended model
–
E.g. currently Skype and RTP are detected with some
imperative code in addition to regex
29
Is there anything better than DPI?
Better perhaps no, but…
Better perhaps no, but…
•
Service-Based Traffic Classification is surely an answer
•
Not exactly a replacement of DPI
•
Instead, something orthogonal to (I would like to say most)
traffic classification approaches
•
Service-Based Classification:
Once you associated (IP, port) with Service S, all established
sessions that insist on that endpoint are associated to S
31
Service-Based Traffic Classification
Service-Based Traffic Classification
•
No further details are provided in this presentation
•
However, a lot of analysis done that confirm that it really
works
•
By-product: if the first classification is correct, a lot of more
traffic classified
–
A service with a few sessions in clear and most encrypted
traffic
SBC: Services vs. sessions
SBC: Services vs. sessions
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
200000
0
20
40
60
80
100
120
140
160
Time (hours)
Services
Sessions
33
Conclusions
Conclusions
•
DPI well-known limit is encrypted sessions
–
No way to cope with that with DPI alone
•
DPI (for traffic classification) may not be so costly compared to
other competitors and have many advantages
–
E.g. no training (regex are “simple” to derive)
–
Simple implementation
–
Most of time, walks over small portions of DFA (in cache)
•
Service-Based Classification may be a good complement of
previous solutions
•
My 2c: statistical traffic classifiers may have a better fit with a
limited number of protocols (i.e. if you want to identify just
P2P) but are not applicable to hundreds of protocols
Questions?
35
References
References
[1] A. Moore, K. Papagiannaki, Toward the Accurate Identification of Network Application, 6th International Workshop on Passive and Active Network Measurement,Boston MA, USA, May 2005, pp. 41-54.
[2] F. Risso, A. Baldini, M. Baldi, P. Monclus, O. Morandi, Lightweight, Payload-Based Traffic Classification: An Experimental Evaluation, IEEE International Conference on Communications (ICC 2008), Beijing (China), pp. 5869-5875, May 2008.
[3] J. Erman, A. Mahanti, M. Arlitt, C. Williamson, Identifying and discriminating between web an peer-to-peer traffic in the network core, Proceedings of the 16th International Conference on World Wide Web, Banff, Alberta, Canada pp. 883 - 892, 2007.
[4] J. Erman, M. Arlitt, A. Mahanti, Traffic classification using clustering algorithms, Proceedings of the 2006 SIGCOMM, Pisa, Italy, pp. 281 - 286, 2006.
[5] L. Bernaille, R. Teixeira, I. Akodkenou, Traffic classification on the fly, 4th ACM/IEEE Symposium on Architectures for Networking and Communications Systems, San Jose, CA, pp. 40-49, 2008.
[6] S. Zander, T. Nguyen, G. Armitage, Self-learning IP traffic classification based on statistical flow characteristics, International Workshop on Passive and Active Network Measurement, Boston MA, pp. 325-328, 2005.
[7] M. Crotti, M. Dusi, F. Gringoli, L. Salgarelli, Traffic Classification through Simple Statistical Fingerprinting, ACM SIGCOMM Computer Communication Review, Vol. 37, No. 1, pp. 5-16, Jan. 2007.
[8] L. Bernaille, R. Teixeira, K. Salamatian, Early Application Identification, 2nd CoNEXT Conference, Lisboa, Portugal, Dec. 2006. [9] A. Este, F. Gringoli, L. Salgarelli, Support Vector Machines for TCP Traffic Classification, Universit` degli Studi di Brescia, Technical Report a. 08-07, Jul. 2008.
[10] N. Williams and S. Zander and G. Armitage, A Preliminary Performance Comparison of Five Machine Learning Algorithms for Practical IP Traffic Flow Classification, SIGCOMM Computer Communication Review, Vol. 36, No. 5, , pp. 7-15, Oct. 2006.
[11] H. Kim, Kc Claffy, M. Fomenkova, D. Barman and M. Faloutsos, Internet Traffic Classification Demystified: The Myths, Caveats and Best Practices, ACM CoNEXT, Madrid, Spain, Dec. 2008.
References
References
[13] T. Karagiannis, K, Papagiannaki, M. Faloutsos, BLINC: Multilevel traffic classification in the Dark, ACM SIGCOMM, Aug. 2005.
[14] A. Este, F. Gargiulo, F. Gringoli, L. Salgarelli, C. Sansone, Pattern Recognition Approaches for Classifying IP Flows, 7th International Workshop on Statistical Pattern Recognition, Orlando, FL, Dec. 2008.
[15] V.N. Vapnik, Statistical Learning Theory. John Wiley and Sons, New York, 1998.
[16] B. Scholkopf, J.C. Platt, J. Shawe–Taylor, A.J. Smola, R.C. Williamson, on Estimating the Support of a High–Dimensional Distribution. Neural Computation, 13, pp. 1443–1471, 2001.
[17] Computer Networks Group (NetGroup) at Politecnico di Torino. The NetBee Library. August 2004. [online] Available at http://www.nbee.org/.
[18] Hash map container reference, http://www.sgi.com/tech/stl/hash map.html
[19] O. Morandi, F. Risso, M. Baldi, A. Baldini, Enabling flexible protocol processing through dynamic code generation, International Conference on Communications, Beijing (China), pp. 5849 - 5856, May 2008.
[20] flex: The Fast Lexical Analyzer, http://flex.sourceforge.net/
[21] R. Smith, C. Estan, S. Jha, S. Kong, Deflating the big bang: fast and scalable deep packet inspection with extended finite automata, ACM SIGCOMM Computer Communication Review, Volume 38 , Issue 4 (October 2008), Pages 207-218.
[22] M. Becchi, P. Crowley, Efficient regular expression evaluation: Theory to pratice, Symposium On Architecture For Networking And Communications Systems, Proceedings of the 4th ACM/IEEE Symposium on Architectures for Networking and Communications Systems, San Jose, California, Pp. 50-59, 2008.
[23] S. Kumar, S. Dharmapurikar, F. Yu, P. Crowley, J. Turner, Algorithms to accelerate multiple regular expressions matching for deep packet inspection, ACM SIGCOMM Computer Communication Review, Volume 36, Issue 4, pp. 339 - 350, October 2006
[24] File Transfer Protocol (FTP), RFC 959, http://www.ietf.org/rfc/rfc959.txt
[25] N. Brownlee, Traffic flow measurement: Meter MIB, Request for Comments RFC 2064, Internet Engineering Task Force, January 1997. [26] F. Gringoli, L. Salgarelli, M. Dusi, N. Cascarano, F. Risso, K.C. Claffy, GT: picking up the truth from the ground for Internet traffic,