2.3 Related Work
2.3.2 Traffic Classification
2.3.2.3 Classification using TCP/IP Headers
Application Protocol Classification Since deep packet inspection approaches do not work on encrypted traffic and raise privacy concerns, approaches that use only TCP/IP headers need to be developed to classify traffic. Karagiannis et al. [2004] proposed a method for classifying traffic as being either a P2P application or not. Karagiannis et al. [2004] leveraged a common behavior exhibited by P2P applications where they use both UDP and TCP during the same time interval to distinguish it from other applications. Karagiannis et al. [2004] also used other heuristics such as segment size to help classify P2P applications. Karagiannis
23Formerly known as OpenDPI.
et al. [2005] extended the ideas of this work by using additional heuristics to identify other applications in addition to P2P including HTTP, chat, or even an attack. For example, Karagiannis et al. [2005] classified flows that do not send transport-layer segment payloads as attacks. Karagiannis et al. [2005] also proposed using segment transmission patterns to identify a collection of hosts that frequently communicate — this collection of hosts are referred to as a community. The authors used these communities to identify gaming applications. Dewes et al. [2003] conducted a study that was focused on identifying chat traffic. Dewes et al. [2003] found that segment interarrival time, TCP connection duration, and small segment sizes were particularly informative for distinguishing chat traffic from other types of traffic.
These prior work used a wide variety of features to classify applications. However, these methods rely on heuristics to perform the classification. There are a number of techniques with a strong theoretical foundation that can be used for traffic classification. Crotti et al. [2007] proposed classifying traffic by using probability distributions of applications and determine whether unknown traffic matches the stored traffic distribution within a certain predefined threshold. Similar to prior approaches, Crotti et al. [2007] classified traffic that corresponds to application layer protocols including POP3 and HTTP. Moore and Zuev [2005] used the Naive Bayes classifier, a probabilistic machine learning approach, for traffic classification. Roughan et al. [2004] also proposed using machine learning approaches, particularly Linear Discriminant Analysis and K-Nearest Neighbors, for traffic classification. The study by Roughan et al. [2004] is unique from most prior classification studies in that it was interested in classifying traffic for traffic engineering applications. Traffic engineering is a problem where ISPs strategically allocate network resources for traffic with different performance or Quality of Service (QoS) requirements. Thus, the categories of traffic that were targeted were not based entirely on application protocols. Some of the classes that Roughan et al. [2004] targeted were bulk (large file transfers), interactive (realtime applications that require user input such as a remote login), streaming (real-time applications such as video), and transactional (applications that transfer a small number of requests). Schatzmann et al. [2010] used the Support Vector Machines (SVM) machine learning approach to classify traffic as either being mail or non-mail. Schatzmann et al. [2010] found that some temporal features (e.g., duration of a TCP connection and TCP connection interarrival times) can be reliably used to differentiate mail from non-mail traffic.
Kim et al. [2008] conducted a study to compare the performance of the many different methods used in the traffic classification literature. Kim et al. [2008] found that the machine learning approaches outper- forms heuristic methods [Karagiannis et al., 2005] and port-based approaches [Moore et al., 2001, Touch
et al., 2013]. Kim et al. [2008] also highlighted that machine learning methods require a large amount of training data in order to achieve high performance. Lim et al. [2010] conducted a study that investigated which machine learning approaches performed the best. Lim et al. [2010] concluded that machine learning methods can achieve similar performance if appropriate, and sometimes essential, preprocessing and data transformation methods are used on the traffic features. Though, Lim et al. [2010] also found that some techniques such as classification trees and K-Nearest neighbors performed well without any additional pre- processing. We note that the authors of this study did not tune each machine learning method during their evaluation. Thus, it is unclear whether other methods, such as Support Vector Machines or Naive Bayes, can perform well without preprocessing as well.
The machine learning methods that have been discussed are supervised machine learning techniques. Recall, supervised machine learning techniques require labeled training data to work. There have also been traffic classification methods that use unsupervised machine learning techniques which do not require la- beled data. Unsupervised machine learning techniques are commonly referred to as clustering algorithms. While clustering algorithms do not require training data, additional effort, and most likely the assistance from a domain expert, is needed to label and interpret the clusters that are output from them. McGregor et al. [2004] investigated the feasibility of using the expectation maximization (EM) algorithm (a clustering method) for traffic classification. McGregor et al. [2004] found that the resulting clusters included many different types of applications. Ideally, different applications should appear in different clusters. Erman et al. [2006] also investigated whether clustering algorithms can be used for traffic classification. Erman et al. [2006] evaluated the DBSCAN, K-means, and Autoclass clustering approaches and found that ap- proximately 150 clusters were needed for the majority of each cluster to include a single type of traffic. 150 clusters is a large number considering that Erman et al. [2006] were trying to distinguish less than 10 classes of traffic. This result implies that traffic behavior is diverse and a single cluster cannot be used to classify an application. Hern´andez-Campos et al. [2003b] used a different class of clustering approach, hierarchical clustering, for distinguishing traffic. Hierarchical clustering groups instances in a hierarchy and is more robust than traditional clustering methods at clustering diverse datasets. Hern´andez-Campos et al. [2003b] found that the results from hierarchical clustering were able to distinguish applications such as P2P and Web. This body of work, which is included in Table 2.5, is related to this dissertation because it consists of
the state-of-the-art traffic classification methods that use TCP/IP headers.25 However, the related work that we discussed focuses primarily on the problem of application classification. The work presented in Chapter 4 uses similar techniques, particularly learning-based classification methods, to advance the state-of-the-art in traffic classification by focusing on the problem of web page classification.
Anomaly detection is an area of traffic analysis that is related to traffic classification. The goal of anomaly detection is to identify instances in network traffic that deviate from “normal” behavior. Thus, anomaly detection problems at their simplest level consider two classes of traffic, normal and abnormal/anomalous. Most anomaly detection methods in the literature are not focused on understanding and classifying web ap- plications nor other application layer protocols. These methods are instead more focused on the network layer information and identifying security or network management-related issues (e.g., port scanning, flash crowds, attacks, routing errors, etc) [Barford et al., 2002, Lakhina et al., 2005, Soule et al., 2005, Brauckhoff et al., 2006, Ringberg et al., 2007, John and Tafvelin, 2007, Nychis et al., 2008, Milling et al., 2012, Silveira et al., 2010, Yan et al., 2012].