Analyzing Network - Wide Interactions Using Graphs: Techniques and Applications

(1)

eScholarship provides open access, scholarly publishing services to the University of California and delivers a dynamic research platform to scholars worldwide.

Electronic Theses and Dissertations UC Riverside

Peer Reviewed Title:

Analyzing Network-Wide Interactions Using Graphs: Techniques and Applications Author:

Iliofotou, Marios

Acceptance Date: 2011

Series:

UC Riverside Electronic Theses and Dissertations

Degree:

Ph.D., Computer ScienceUC Riverside

Advisor(s):

Faloutsos, Michalis

Committee:

Krishnamurthy, Srikanth, Molle, Mart

Permalink:

http://escholarship.org/uc/item/1j5050q7

Abstract:

Copyright Information:

All rights reserved unless otherwise indicated. Contact the author or original publisher for any necessary permissions. eScholarship is not the copyright owner for deposited works. Learn more at http://www.escholarship.org/help_copyright.html#reuse

(2)

UNIVERSITY OF CALIFORNIA RIVERSIDE

Analyzing Network-Wide Interactions Using Graphs: Techniques and Applications

A Dissertation submitted in partial satisfaction of the requirements for the degree of

Doctor of Philosophy in Computer Science by Marios Iliofotou March 2011 Dissertation Committee:

Dr. Michalis Faloutsos , Chairperson Dr. Srikanth Krishnamurthy

(3)

Copyright by Marios Iliofotou

(4)

The Dissertation of Marios Iliofotou is approved by:

Committee Chairperson

(5)

ACKNOWLEDGMENTS

I would like to express my gratitude to all those who made this dissertation possible. Pri-marily I thank my advisor, Professor Michalis Faloutsos, for his dedication to this work and for so strongly believing in me. I would also like to express my gratitude to my collab-orators: Professor George Varghese from UC San Diego, Professor Michael Mitzenmacher from Harvard University, Professor Tina Eliassi-Rad from Rutgers University, Brian Gal-lagher (Lawrence Livermore Labs), Prashanth Pappu (Conviva), Sumeet Singh (Cisco Sys-tems), for their invaluable guidance, suggestions, and advises.

Finally, I would also like to thank Flavio Bonomi and Cisco Systems, Inc., for their support throughout this work. Support for this work was provided by several Cisco URP grants, and from the NSF grants NETS-0721889 and NECO-0832069. Parts of this work were performed under the auspices of the U.S. Department of Energy by Lawrence Liv-ermore National Laboratory under Contract DE-AC52-07NA27344. LLNL-CONF-461043. Special thanks to Thomas Karagiannis (Microsoft Research Cambridge), Kenjiro Cho (WIDE), Ram Keralapura (Narus), and Antonio Nucci (Narus) for sharing their codes and datasets. The SDSC’s TeraGrid and compute resources are supported by the NSF grant CONMI CRI-0551542. Support for CAIDA’s Internet traces is provided by the National Science Foundation, the US Department of Homeland Security, and CAIDA Members.

(6)

(7)

ABSTRACT OF THE DISSERTATION

Analyzing Network-Wide Interactions Using Graphs: Techniques and Applications by

Marios Iliofotou

Doctor of Philosophy, Graduate Program in Computer Science University of California, Riverside, March 2011

Dr. Michalis Faloutsos , Chairperson

The fundamental problem that motivates this dissertation is the need for better meth-ods and tools to manage and protect large IP networks. In such networks, it is essential for administrators to profile the traffic generated by different applications (e.g., Web, BitTor-rent, FTP) and be able to identify the packets of an application in the wild. This enables administrators to effectively accomplish the following key tasks: (a) Manage the network: It allows different policies to be applied to different applications, e.g., rate limit peer-to-peer (P2P) traffic during busy hours. (b) Protect the network: Profiling malicious traffic requires a strong separation from benign traffic, therefore, knowing the behavior of good application provides better separation from malicious activity. Despite some significant efforts to solve the traffic profiling problem, none of the existing methods address all rel-evant problems. The difficulty of the problem comes from the following three factors: (a) The intentions of application writers and users to hide their traffic using obfuscation (e.g., payload encryption); (b) The limited information about flows and IP-hosts when traffic is monitored at the Internet backbone; and (c) The continuous appearance of new applica-tions as well as undocumented changes to existing network protocols.

(8)

fo-cuses on the network-wide interactions of IP-hosts (as seen at a router). To facilitate the analysis of network-wide interactions, we represent traffic as a graph, where each node is an IP address, and each edge represents a type of interaction between two nodes. We use the term Traffic Dispersion Graph or TDG to refer to such a graph. Intuitively, TDGs cap-ture the social behavior of network hosts, which, as we show here, it is hard to obfuscate. For example, a P2P protocol cannot function while trying to hide its overlay network, as maintaining a network overlay is a fundamental behavior of a P2P protocol. This disserta-tion focuses on three key aspects of network-wide interacdisserta-tions: (a) The graph shapes and structures formed by different applications; (b) The distinctive dynamic network-wide behavior of network application (i.e., how the graphs change over time); and (c) The iden-tification of communities formed by IP-hosts over the Internet. Using the traffic analysis techniques we propose here, we develop novel traffic profiling solutions that are robust to obfuscation and can operate at the backbone, which are both very challenging to address with the current state-of-the-art. To evaluate the effectiveness of our methods, we use real-world traffic traces collected from six different networks. This dissertation presents the first work to explore the full capabilities of TDGs for profiling and analyzing traffic. Based on our results, we believe that TDGs can provide the basis for the next generation of traffic monitoring tools.

(9)

List of Figures

1.1 Basic functionalities of a network monitoring system. The traffic analyzer extracts useful information about the network that can be used to: (a) manage the network by controlling or prioritizing the traffic from specific applications, (b) provide visualizations that summarize the status of the network, and (c) generate summaries of the traffic that can be logged for future reference. The goal of this dissertation is to provide better methodologies to analyze network traffic. . . 3 2.1 A TDG representing a five-minute traffic snapshot of the FTP application . The figure

shows that TDGs can have multiple connected components. For visualization purposes the largest connected component is located in the center of the figure. . . 10 4.1 Two TDG visualization contrasting a P2P (left) with a client-server application (right). Largest

component is with bold (black) edges - hence the graphs are best viewed on a computer screen or a colored print-out. . . 28 4.2 Graphical representation of the BitTorrent TDG with 3,000 IPs, showing the formation of

long paths connecting P2P hosts. The largest connected component is highlighted with darker color edges. . . 31 4.3 The average degree for various P2P protocols over one month in trace TR-ABIL.

Can-dle sticks show the maximum and minimum recorded values together with the average (horizontal line) and±the standard deviation. For visualization purposes, for the MP2P application we show only the minimum value which is very high (23.73). . . 33 4.4 The effect of changing the interval of observation for TDGs ranging from five seconds up to

15 minutes over a large set of protocols. To reduce variability we can choose to use longer intervals. After 300 seconds the reduction in variability is not significant. . . 34 4.5 Evaluating K-means: With sufficiently largek(>120) K-means can efficiently separate the

flows of different application. All classification metrics are in terms of flows. . . 38 4.6 Graption-P2P achieves> 90%flow F-Measure over a large range of similarity thresholds

(12)

4.7 Graption-P2P compared to cluster labeling based on ground truth. Results are also com-pared with and without cluster merging. . . 40 4.8 The percentage of P2P flows detected by Graption-P2P and BLINC. Flow precision for

Graption-P2P is 95% and for BLINC is 89%. . . 41 5.1 The “edge volatility” of five different applications over time for the PAIX trace (see Table 5.1).

All P2P TDGs change more over time compared to the SMTP and DNS TDGs. . . 52 5.2 Collaborative vs Non-Collaborative applications: graph classification performance using

static metrics. For each experiment we executed 50 runs with different random training samples. . . 57 5.3 P2P vs Rest: Graph classification performance using static metrics. We need a larger

train-ing size in order to achieve good results compared to the Collaborative vs Non-Collaborative classification. . . 58 5.4 Our three dynamic heuristics for the KSCY trace. The applications separated at each

heuristic are reported first (leftmost side). Notations: Game (GM), Gnutella (GNU), Fast-Track (FT), WinMX (WIN), eDonkey (EDO), MP2P (MP). . . 59 5.5 Final TDG-based classification process combining static (unary) and dynamic (binary)

met-rics. From top to bottom, at each level of the decision tree, the classifier selects the easier to identify application. . . 61 5.6 The changes in average degree (bottom) and the number of edges (top) of the Web TDG

after we inject Gnutella traffic. Plots show both the values of the metrics under no pollution (normal) and after pollution (polluted). . . 65 5.7 Successful detections over different pollution intensities. For small intensities, our dynamic

metrics result in more detections compared to the static. The combination of dynamic and static metrics gives better results in the majority of our experiments. . . 67 5.8 Detection rate per pollution intensity for three different backbone links. . . 69 6.1 Applications form communities that are visible from the Internet backbone. In the graph

visualization, nodes are IP addresses and edges represent flows between IPs. In this work, we take advantage of these communities in order to profile all the flows (edges in the graph) in a trace. For visualization purposes, only traffic from three applications is visible. To highlight the three communities we color the nodes based on their dominant application (better viewed in a color print-out or a monitor screen). The data is from a peering link of a large ISP in the US. . . 73 6.2 Overview of PBA framework for traffic profiling. . . 78

(13)

6.3 Performance for various community discovery algorithms on trace graphs. The left plot shows the accuracy of PBA’s CLUS algorithm. The right plot shows execution time for the WIDE trace. Execution times for other traces were similar. . . 88 6.4 Classification results for various PBA algorithms using 1% seed size. . . 89 6.5 Effect of seed size for the MFN trace. The left plot shows overall accuracy. The right plot

depicts standard error over various seed sizes (across 20 trials). . . 90 6.6 Effect of random errors in the seeds for the BRAZ trace. Other traces are qualitatively

similar. The left plot shows results using 1% seed size; the right plot depicts results using 10% seed size. . . 91 6.7 Seeding using seeds from a host-based and port-based classifier. The seeds are from the

combination of BLINC and CoralReef (called ENSEMBLE). . . 92 6.8 Effect of connectivity obfuscation for the BRAZ trace. This is the most challenging trace

since it has the highest number of hosts with multiple applications. . . 93 6.9 Obfuscation by blending for the BRAZ trace. P2P traffic is trying to tunnel its traffic using

(14)

Chapter 1

(15)

Monitoring and analyzing network traffic are essential for protecting and managing large IP networks. They enable network administrators to gain in-depth understudying of the behavior of the traffic and help them answer critical questions about their network, such as: “Who is using my network and for what?”, “Which applications are running on my network?”, “Half of my traffic is encrypted; is that traffic regular Web traffic or some-thing else (and potentially malicious)?” The basic components and functionalities of a traffic monitoring system are shown in Figure 1.1. The traffic analyzer extracts useful in-formation regarding the network that can be used to: (a) manage and protect the network by controlling the traffic from different applications, (b) provide visual summaries of the network status to human operators, and (c) generate summaries of the network that can be stored for future reference. The focus of this dissertation in on task (a), which faces some key challenges as we discuss next in more detail. Nevertheless, some of the techniques and visualizations we introduce may find uses in the two other tasks as well. Ultimately, the main goal of this dissertation is to provide novel insights about the traffic as well as introduce new methodologies to protect and manage large IP networks. We focus our study on large backbone networks, which are harder to protect and manage than smaller networks.

Profiling network traffic is the fundamental first step in protecting and managing the network. It allows the administrators to identify the traffic generated by different appli-cations, such Web, Email, Video, and peer-to-peer (P2P) file sharing. Traffic profiling is useful for two key reasons: (a) Managing the network: It enables different policies to be applied to different applications, e.g., rate limit P2P traffic during peak hours. We see this in Figure 1.1, where analyzed information can be used as feedback to trigger different actions by the network. In addition, the traffic profiler pinpoints to the applications con-tributing most of the traffic, which is critical for network planning. For example, in the case where Web is dominating, the deployment of a Web cache is the right choice for re-ducing the outgoing traffic volume. (b) Protecting the network: Profiling malicious traffic

(16)

Traffic Analyzer Visualizer Network Managment Disk IP Network Network Administrator e.g., prioritize, rate control, and/or block

traffic for some applications Raw Packet Data Network Status Logging Useful Information

Figure 1.1: Basic functionalities of a network monitoring system. The traffic analyzer extracts useful in-formation about the network that can be used to: (a) manage the network by controlling or prioritizing the traffic from specific applications, (b) provide visualizations that summarize the status of the network, and (c) generate summaries of the traffic that can be logged for future reference. The goal of this dissertation is to provide better methodologies to analyze network traffic.

requires a strong separation from benign traffic, therefore, knowing how the “good” traf-fic behaves provides easier separation from malicious activity. Overall, is hard to protect what you cannot see: If large portions of the traffic is unknown, it leaves larger room for traffic from malicious or unwanted applications to dominate.

Despite some significant efforts to solve the traffic profiling problem, none of the ex-isting methods address all relevant problems. The difficulty of the problem comes from the (a) intentions of application writers and users to obfuscate and hide their traffic, (b) the limited information about flows and IP-hosts at the backbone, and (c) the continues appearance of new network applications as well as undocumented changes to existing protocols. In fact, we observed in our study that using standard deep packet inspection (DPI) leaves up to 60% of the traffic unknown in some networks. Due to the enforcement of traffic shaping policies by Internet Service Providers (ISPs) [2] the most challenging ap-plication to identify is P2P. Detecting P2P traffic is a potentially important problem for ISPs that want to manage such traffic, and for specific groups such as the entertainment industry in legal and copyright disputes. Detecting P2P traffic also has particular inter-est since it represents a large portion of the Internet traffic. [3] In fact, a resent work by Labovitz et al. [3] shows that at least 20% of all Inter-domain traffic on the Internet is P2P.

(17)

Even though percentage wise P2P traffic is not the dominant Internet traffic contributor today, its overall traffic volume is constantly increasing. In addition, other studies using advanced deep-packet inspection show P2P traffic to be up to 70% of the overall volume in some networks [4]. One of the key goals in this dissertation is to provide new techniques that can effectively identify P2P traffic even when the protocols enforce obfuscation and undergo continues changes.

Most current traffic profiling methods can be naturally categorized according to their level of observation: packet-level methods [5, 6], flow-level statistical approaches [7, 8], or host-level methods [1, 9]. Each existing approach has its own pros and cons, and no single method clearly emerges as a winner. Relevant challenges that need to be consid-ered include: (a) detecting applications even when they intentionally hide their traffic (we address this in Chapter 6), (b) operating at backbone links [10, 1], where we have partial information about the flows and the hosts (all our proposed methods address this chal-lenge); and (c) identifying applications that are new, and thus without a known profile (we address this in Chapter 4). As we show in Chapters 3 and 6, packet- and flow-level methods are easy to evade. Moreover, these methods require per application training and will thus not detect traffic from emerging protocols. Behavioral host-based approaches such as BLINC [1] can detect traffic from new protocols [1], but have weak performance when applied at the backbone [10]. In addition, most tools including BLINC [1] require fine-tuning and careful selection of parameters [10]. We discuss the limitations of previous methods in more detail in Chapter §6.3.

In this dissertation, we analyze the network-wide interactions among IP-hosts, which is a new approach to extract information about network traffic. To facilitate the analysis of network-wide interactions, we represent traffic as a graph, where each node is an IP address, and each edge represents a type of interaction between two nodes. We use the term Traffic Dispersion Graph or TDG to refer to such a graph [11]. For example, in Chapter 4 we show that with TDGs we enable the detection of behavior (e.g., highly

(18)

con-nected graphs) that is common among P2P applications and different from other traffic (e.g., Web). While we recognize that some previous efforts [12, 13] have used graphs to detect worm activity, they have not explored the full capabilities of TDGs for profiling and analyzing traffic at the extend we do in this dissertation. In fact, the work here is the first to reveal: (a) the graph shapes and structures formed by different applications (Chapter 4); (b) the distinctive dynamic network-wide behavior of network application (Chapter 5); and (c) the formation of communities by IP-hosts over the Internet (Chapter 6).

We divide our study of network-wide interactions using TDGs into three main chap-ters. In each chapter we explore different aspects of TDGs and reveal key insights into the network-wide behavior of applications. In addition, at each chapter we precent a sys-tematic way of using information extracted from TDGs to build effective tools to facilitate the protection and management of real-world networks. All our methods are evaluated using a diverse set of real-life network traffic traces collected from six different backbone locations (see Chapter 2 for more details). Next we describe each of the main chapters and highlight their key contributions. This work is divided in three parts as follows:

• Chapter 4: Identifying structural differences in the TDGs of different

applica-tions. This is the first study to show that network-wide interactions of different

application classes (e.g., P2P versus client-server) results in graphs with very differ-ent structural shapes and properties. Surprising perhaps, we observe these network-wide behaviors even when monitoring the network from a single backbone location. In addition, the chapter covers a large scale study of TDGs from different appli-cations and shows that the basic properties of the graphs remain fairly stable over space (different backbone locations) and time (different dates and times). As a di-rect application of this analysis, we introduce Graption, a systematic framework for utilizing these differences to identify P2P traffic at the network core. We apply our framework to real-world traffic an show that it identifies 90% of all P2P flows with 95% accuracy, which is particularly challenging for other methods.

(19)

• Chapter 5: Exploiting the dynamicity in the analysis of network-wide interactions. In this chapter, we present the first study of network-wide dynamic behaviors that we are aware of. Towards this end, we introduce a novel methodology to analyze network-wide interactions using a sequence of graph snapshots taken over differ-ent time periods. This enables us to capture key dynamic properties of applications that are not possible to reveal by using a “static” graph representation. To highlight the advantages of our analysis, we use our methodologies to improve the Graption framework to support fine-grain separation between P2P file-sharing and collabo-rative applications (e.g., DNS). In addition, we provide a new method of detecting changes to the profiles of legacy applications (e.g., SMTP) caused by polymorphic blending attacks. To the best of our knowledge, this work is the first to introduce the problem of polymorphic blending in traffic profiling.

• Chapter 6: Revealing the formation of communities by different applications. The key question addressed here is: “Is it possible to profile traffic at the backbone without relying on its packet and flow level information, which can be obfuscated?” In this chapter, we “zoom out” from studying a single application at time and look at the global TDG formed by the interactions caused from all applications. This allows us to study how TDGs of different applications are interrelated. The key in-sight is that IP-hosts tend to communicate more frequently with hosts involved in the same application forming communities (or clusters). We propose a novel ap-proach, called Profiling-By-Association (PBA), that uses only the global TDG and information about some applications used by few IP-hosts (a.k.a. seeds). Profil-ing few members within a cluster can “give away” the whole community. FollowProfil-ing our approach, we develop different algorithms to profile Internet traffic and evaluate them on real-traces from four large backbone networks. We show that PBA’s accu-racy is on average around 90% with knowledge of only 1% of all the hosts. Our PBA

(20)

framework provides a valuable tool for network administrators to detect obfuscated traffic, which would otherwise be very hard to identify using current state-of-the-art classifiers.

The rest of this dissertation is organized as follows. In Chapter 2 we provide the ba-sic background on TDGs, the problem of traffic classification/profiling, and describe the datasets used in our study. Chapter 3 covers the most important related work regarding the fields of network-wide interactions, traffic profiling, and security. Our study of the structural properties of TDGs from different applications is the main topic of Chapter 4. Chapter 5 includes our study of dynamic behaviors and the identification of polymorphic blending attacks. In Chapter 6 we study the formation of communities in the global graph of all interactions and show how we can use it to design a traffic profiling tool with ro-bustness to obfuscation. The main conclusions and the key take-ways of this dissertation are in Chapter 7.

(21)

Chapter 2 Background

(22)

This chapter covers the following three topics: (a) It provides a basic background on TDGs. (b) It offers a detailed description of the datasets used throughout this dissertation. (c) It provides the basic terminology and definitions covering the area of traffic profil-ing/classification. The basic definitions of TDGs and of many graph metrics can also be found in [11], which is not a contribution of this dissertation. For completes, however, the content is also provided here. Readers may wish to skip the details of this chapter and return to it as a reference when needed.

2.1 Traffic Dispersion Graphs (TDGs)

A TDG refers to a graph G(V, E) that represents the network-wide interaction (say “who talks to whom”) from a data trace1_{. In a TDG, each node corresponds to a distinct IP}

ad-dress, and an edge signals an interaction between a pair of nodes. The power of TDGs lies in the flexibility of deciding what constitutes an interaction, which can be implemented in practice by what we refer to as edge filter. An interaction could correspond to a simple packet exchange, or could be determined by a complex rule, such as “at least three TCP packets at port 25 were exchanged.” In their more general form, TDGs can be directed and weighted as we discuss below.

A TDG represents a static traffic snapshot. Given an edge filter, we can create a TDG

that represents all the edges that matched the filter during a fixed interval of observation, which we denote by T . Given a group of flows S, collected over a time interval T , we define the corresponding TDG to be a directed graph G(V, E), where the set of nodes V corresponds to the set of IP addresses in S, and there is a link (u, v) ∈ E from u to v if there is a flow f ∈ S between them. Throughout this dissertation, we assume that pack-ets can be grouped into flows using the standard 5-tuple {srcIP, srcPort, dstIP,

dstPort,protocol_{}. We therefore use the terms “group of flows” and “group of}

(23)

1 2 3 4 2 3 1 3 1 2 9 1 0 7 8 1 6 9 7 7 1 8 5 6 1 2 1 4 6 1 6 1 3 6 8 5 1 1 1 2 1 5 1 3 1 4 9 5 0 1 7 1 8 4 2 5 1 9 2 0 2 3 2 4 2 1 2 2 9 6 8 2 5 2 6 5 6 6 4 0 6 6 6 2 7 2 8 2 9 3 0 3 1 3 2 8 7 6 3 3 3 4 5 5 7 4 1 1 4 1 4 3 1 5 4 1 8 0 2 9 2 3 3 9 3 4 3 4 3 0 4 8 2 4 8 3 5 5 6 5 8 6 6 1 9 6 3 2 6 4 5 6 7 3 6 7 9 7 2 6 7 2 3 7 8 8 8 2 2 8 3 2 8 5 5 8 7 9 9 4 5 9 4 9 9 5 4 1 0 3 1 1 0 4 4 1 0 4 8 3 5 3 6 3 7 9 5 5 3 8 3 9 7 7 0 4 1 4 0 4 2 4 3 4 4 4 5 4 6 4 7 4 8 2 7 1 4 0 2 4 2 7 5 1 3 7 3 4 8 3 8 4 9 5 0 5 1 5 2 1 5 1 1 9 1 3 3 1 3 4 5 5 3 5 4 4 3 1 5 9 5 7 5 8 6 0 6 1 6 4 6 2 6 3 5 8 7 6 0 3 7 1 5 6 5 6 6 6 7 6 8 6 9 7 0 7 1 7 2 7 3 1 7 3 4 5 8 5 3 3 7 7 7 8 7 5 7 6 7 9 8 0 8 1 8 2 1 1 1 8 7 8 8 8 5 8 6 6 7 6 8 3 8 4 8 9 9 0 9 1 9 2 1 0 1 9 5 9 6 9 3 9 4 9 8 9 9 5 1 4 1 0 0 1 0 2 1 0 3 1 0 4 1 0 5 1 0 6 1 0 7 1 0 8 1 0 9 1 1 0 1 1 2 1 1 3 1 1 5 1 9 7 1 2 0 1 1 8 1 1 9 1 1 6 1 1 7 1 2 2 1 2 5 1 2 6 3 7 5 1 2 3 1 2 4 1 2 7 1 2 8 1 2 9 1 3 0 1 3 1 1 3 6 1 3 4 1 3 5 1 3 2 1 3 3 1 3 7 1 3 8 1 4 1 1 4 2 1 3 9 1 4 0 2 7 0 4 6 0 7 5 8 9 9 5 1 4 8 1 4 9 1 4 6 1 4 7 1 4 4 1 4 5 1 5 5 1 5 6 1 5 2 1 5 3 1 5 0 1 5 7 1 5 8 1 5 9 1 6 0 1 6 1 1 6 2 1 6 3 1 6 4 1 6 5 1 6 6 1 6 7 1 6 8 1 6 9 1 7 2 1 7 0 1 7 1 7 8 3 1 7 4 1 7 5 1 7 6 1 7 7 1 7 8 1 7 9 1 8 1 1 8 2 2 5 7 6 4 1 6 9 9 1 8 3 1 8 4 1 8 5 1 8 6 1 8 7 1 8 8 1 8 9 1 9 0 1 9 2 1 9 3 1 9 4 1 9 5 1 9 6 1 9 8 1 9 9 2 0 0 3 5 6 2 0 1 2 0 2 2 0 3 2 0 4 2 0 5 2 0 6 2 0 7 2 0 8 2 0 9 2 1 0 2 1 1 2 1 2 2 1 3 2 1 4 2 1 7 2 1 8 2 1 5 2 1 6 2 1 9 2 2 0 3 0 1 2 2 1 2 2 2 2 2 3 2 4 4 2 2 4 2 2 5 3 0 8 2 2 7 2 2 6 2 2 9 2 2 8 7 2 7 2 3 0 2 3 4 3 2 6 2 3 2 2 3 3 3 1 0 2 3 7 2 3 8 2 3 5 2 3 6 6 1 0 2 3 9 2 4 0 2 4 3 2 4 1 2 4 2 2 4 5 2 4 6 2 4 7 2 4 9 2 4 8 2 5 1 2 5 0 2 5 2 2 5 5 2 5 3 2 5 4 2 5 6 2 5 8 2 5 9 2 6 0 2 6 1 2 6 2 2 6 3 9 9 4 2 6 4 2 6 5 2 6 6 2 6 7 7 1 6 8 9 2 1 0 0 4 2 6 8 2 6 9 2 7 4 2 7 5 2 7 2 2 7 3 2 8 0 2 8 1 5 3 9 2 7 8 2 7 9 2 7 6 2 7 7 2 8 4 2 8 2 2 8 3 2 8 7 2 8 8 2 8 5 2 8 6 5 0 9 2 8 9 2 9 3 2 9 0 2 9 1 2 9 4 2 9 5 2 9 6 2 9 7 2 9 8 2 9 9 3 0 3 3 0 2 7 4 7 3 0 0 3 0 4 3 0 5 3 0 6 3 0 7 3 6 9 3 1 1 3 0 9 3 1 3 3 1 4 3 1 7 3 1 8 3 1 5 3 1 6 3 1 9 3 2 0 3 2 4 3 2 3 3 2 1 3 2 2 3 2 7 3 2 8 3 2 5 3 2 9 3 3 0 3 3 2 3 3 3 3 3 4 6 8 0 3 3 8 3 3 6 3 3 7 4 5 1 3 3 5 3 4 0 3 4 1 3 4 2 3 4 4 3 4 6 3 4 7 3 4 8 3 4 9 3 5 0 3 5 1 8 9 5 9 3 7 3 5 2 3 5 3 3 5 4 3 5 5 5 4 5 3 5 7 3 5 8 3 5 9 3 6 0 3 6 1 3 6 2 3 6 3 3 6 4 3 6 5 3 6 6 3 6 7 3 6 8 3 7 0 3 7 1 3 7 2 3 7 3 3 7 4 3 7 6 3 7 7 3 7 8 3 7 9 3 8 2 1 0 3 6 3 8 0 3 8 1 3 8 5 3 8 3 3 8 4 1 0 2 8 3 8 6 3 8 7 3 9 1 8 1 1 3 8 9 3 9 0 8 3 1 3 8 8 3 9 2 3 9 3 3 9 8 3 9 6 3 9 7 3 9 4 3 9 5 5 3 4 3 9 9 4 0 0 4 0 1 4 0 3 4 0 4 4 0 5 4 0 6 5 9 3 1 0 3 3 4 0 7 4 0 8 4 0 9 4 1 0 4 1 2 4 1 1 4 1 4 4 1 3 4 1 5 4 1 7 4 1 8 1 0 1 4 4 1 6 4 1 9 4 2 4 4 2 2 4 2 3 5 0 0 1 0 1 0 4 2 0 4 2 1 8 4 8 4 2 6 4 2 8 4 2 9 4 3 3 4 3 2 4 3 4 4 3 5 4 7 5 4 9 7 5 9 1 9 6 1 9 8 4 1 0 0 0 1 0 2 4 4 3 6 4 3 7 4 3 8 4 3 9 4 4 0 4 4 1 4 4 2 7 4 8 4 4 3 4 4 4 4 4 5 1 0 1 7 4 4 6 4 4 7 6 6 0 4 4 8 4 4 9 4 5 0 4 5 3 4 5 2 4 5 4 4 5 5 4 5 6 4 5 7 4 5 9 4 6 2 4 6 3 4 6 6 4 6 4 4 6 5 4 6 9 4 7 0 4 6 7 4 6 8 4 7 1 4 7 7 4 7 6 4 7 3 4 7 4 4 7 2 4 7 8 4 7 9 4 8 0 4 8 1 6 8 7 4 8 4 4 8 5 4 8 6 4 8 7 4 8 8 4 9 0 4 9 1 4 8 9 4 9 5 4 9 6 5 4 8 4 9 3 4 9 4 4 9 2 9 6 5 4 9 8 4 9 9 5 0 1 5 0 2 5 0 3 5 0 4 5 0 5 5 0 6 5 0 7 5 0 8 5 1 1 5 1 0 5 1 2 5 1 5 5 1 8 5 1 6 5 1 7 5 2 0 5 2 1 5 1 9 5 2 3 5 2 2 5 2 5 5 2 4 5 2 8 5 2 6 5 2 7 5 2 9 5 3 1 5 3 2 5 3 0 5 3 6 5 3 5 5 3 7 5 3 8 5 4 1 5 4 0 5 4 2 5 4 3 5 4 4 5 4 6 5 4 7 5 5 2 5 5 3 5 5 1 5 4 9 5 5 0 5 5 4 5 5 5 5 5 7 5 5 8 5 5 9 9 4 4 5 6 0 5 6 1 5 6 2 8 9 3 5 6 5 5 6 6 5 6 3 5 6 4 5 7 1 5 6 9 5 7 0 5 6 7 5 6 8 5 7 2 5 7 3 5 7 4 5 7 5 5 7 7 5 7 8 5 7 6 5 7 9 5 8 0 5 8 1 5 8 2 5 8 3 9 7 0 5 8 4 5 8 5 5 8 8 5 8 9 5 9 0 5 9 4 5 9 2 6 0 0 5 9 8 5 9 9 5 9 6 5 9 7 5 9 5 6 0 2 6 0 1 6 0 4 6 0 7 6 0 5 6 0 6 6 1 1 6 1 2 6 0 8 6 0 9 6 1 5 6 1 6 6 1 4 6 1 7 6 1 8 6 2 0 6 2 1 6 2 2 6 2 3 6 2 4 6 2 5 6 2 6 6 2 7 6 2 8 6 2 9 6 3 1 6 3 0 6 3 3 6 3 4 6 3 5 6 3 6 6 3 7 6 3 8 6 3 9 6 4 4 6 4 2 6 4 3 6 4 7 6 4 6 6 5 1 6 5 0 6 4 8 6 4 9 6 5 3 6 5 4 6 5 2 6 5 7 6 5 5 6 5 6 6 5 8 6 5 9 6 6 2 6 6 1 6 6 3 9 1 6 6 6 4 6 6 5 6 6 7 6 6 8 6 6 9 6 7 0 6 7 1 6 7 2 6 7 4 8 3 6 6 7 5 6 7 7 6 7 8 6 8 1 6 8 2 6 8 3 6 8 4 6 8 6 6 9 0 6 9 1 6 8 8 6 8 9 6 9 2 6 9 3 6 9 4 6 9 5 6 9 7 6 9 6 6 9 8 7 0 0 7 0 1 7 0 2 7 0 3 7 0 4 7 0 5 8 2 0 7 0 6 7 0 7 7 0 8 7 0 9 7 1 0 7 1 1 7 1 2 7 1 3 7 1 4 7 1 7 7 2 0 7 2 1 7 1 9 7 2 2 7 2 4 7 2 5 7 2 8 7 2 9 7 3 1 7 3 2 7 3 0 7 3 3 7 3 5 7 3 6 7 3 7 7 3 9 7 3 8 7 4 2 7 4 3 7 8 9 8 2 7 9 6 2 7 4 1 7 4 0 7 4 4 7 4 5 7 4 6 7 4 9 7 5 0 7 5 1 7 5 2 7 5 3 7 5 4 7 5 5 7 5 6 7 5 7 7 5 9 7 6 0 7 6 1 7 6 2 7 6 3 7 6 4 7 6 5 7 6 6 7 6 8 7 6 9 7 6 7 7 7 1 7 7 2 9 9 9 7 7 5 7 7 3 7 7 4 7 7 6 7 7 9 7 7 7 7 7 8 7 8 0 7 8 1 7 8 2 7 8 5 7 8 4 7 8 6 7 8 7 7 9 0 7 9 1 7 9 2 7 9 3 7 9 4 7 9 5 7 9 6 7 9 7 7 9 8 8 0 0 8 0 1 7 9 9 8 0 2 1 0 1 3 8 0 6 8 0 5 8 0 3 8 0 4 8 0 9 8 0 7 8 0 8 8 1 0 8 1 5 8 1 6 8 1 4 8 1 2 8 1 3 8 1 7 8 1 8 8 1 9 8 2 5 8 2 6 8 2 4 8 2 3 8 2 1 8 2 8 8 2 9 8 3 0 8 3 3 8 3 4 8 3 5 8 3 7 8 3 9 8 4 0 8 4 1 8 4 2 8 4 3 8 4 4 8 4 5 8 4 6 8 4 7 8 4 9 8 5 0 8 5 1 8 5 2 8 5 3 8 5 4 8 5 7 8 5 6 8 5 8 8 6 1 8 5 9 8 6 0 8 6 2 8 6 3 8 6 4 8 6 5 8 6 6 8 6 7 8 6 8 8 6 9 8 7 0 8 7 1 8 7 2 8 7 4 8 7 5 8 7 3 8 7 7 8 7 8 8 8 0 8 8 1 8 8 3 8 8 4 8 8 2 8 8 5 8 8 6 8 8 8 8 8 7 8 8 9 8 9 0 8 9 1 8 9 4 8 9 7 8 9 8 8 9 6 8 9 9 9 0 0 9 0 5 9 0 3 9 0 4 9 0 1 9 0 2 9 0 6 9 0 7 9 0 8 9 0 9 9 1 0 9 1 1 9 1 2 9 1 3 9 1 4 9 1 5 9 1 8 9 1 7 1 0 1 5 9 2 0 9 2 1 9 1 9 9 2 2 9 2 4 9 2 3 9 2 8 9 2 9 9 2 6 9 2 7 9 2 5 9 3 0 9 3 1 9 3 2 9 3 3 9 3 4 9 3 5 9 3 6 9 3 8 9 3 9 9 4 1 9 4 0 9 4 2 9 4 3 9 4 6 9 4 7 9 4 8 9 5 1 9 5 2 9 5 3 9 5 6 9 5 7 9 5 8 9 5 9 9 6 0 9 6 3 9 6 4 9 6 6 9 6 7 9 6 9 9 7 1 9 7 2 9 7 5 9 7 3 9 7 4 9 7 6 9 7 7 9 7 8 9 7 9 9 8 0 9 8 3 9 8 1 9 8 2 9 8 6 9 8 5 9 8 7 9 8 9 9 8 8 9 9 0 9 9 3 9 9 2 9 9 1 9 9 6 9 9 7 9 9 8 1 0 0 1 1 0 0 2 1 0 0 5 1 0 0 7 1 0 0 6 1 0 0 8 1 0 1 2 1 0 1 1 1 0 1 6 1 0 1 8 1 0 2 0 1 0 2 1 1 0 2 2 1 0 2 3 1 0 2 6 1 0 2 5 1 0 2 7 1 0 3 0 1 0 2 9 1 0 3 2 1 0 3 4 1 0 3 5 1 0 3 9 1 0 3 7 1 0 4 0 1 0 4 1 1 0 4 2 1 0 4 3 1 0 4 6 1 0 4 5 1 0 4 9

Figure 2.1: A TDG representing a five-minute traffic snapshot of the FTP application . The figure shows that TDGs can have multiple connected components. For visualization purposes the largest connected component is located in the center of the figure.

ets” interchangeably. All flows are consider bidirectional. We define a TCP flow to start on the first packet with theSYN-flag set and theACK-flag not set, so that the initiator and the recipient of the flow are defined for the purposes of direction. For UDP flows, direction is decided upon the first packet of the flow. For the rest of this dissertation, we refer to a TDG capturing a traffic snapshot over a fixed interval T as (static) TDG snapshot.

Depending on the edge filter, we can direct the edges; for example, we might have the sender of the first packet in an interaction be the head of the directed edge. In addition, the edge can be associated with a weight or other associated useful information, such as the number of packets exchanged. A real-world example of a static snapshot is shown in Figure 2.1, representing the FTP traffic over a five-minute interval. Next we discuss edge filters in more depth.

(24)

2.1.1 Edge Filters

For completeness, we present a set of general edge filters that can be used in isolation or in combination to select the right set of flows depending on the focus of the study as discussed.

With a port-based filter, we collect traffic for a fixed destination (source) port. If the port number corresponds to a well-behaved single-port application, this can correspond to monitoring the traffic of that application. Signature or content-based filters, match string patterns in the payload of the packet. These filters can be very effective, but assume that we have the signature of the desired traffic, and we have access to the content of the packet. These assumptions are not always true for all applications or all traces. Using a

flow-level filter, we create a graph with flows that meet certain flow-level features, such as specific packet sizes, packet inter-arrival times, or number of packets. In fact, the grouping of flows into TDGs can be automated (unsupervised). This can be achieved with machine learning and clustering algorithms [10, 14]. The advantage of these filters is that they can be made to work with no a priori information of how applications behave. We have used this approach successfully for application classification as we show in §4. Similar methods have also been used successfully in other work [15, 7, 16, 6, 17].

How do we decide which edge filters to use? This depends on the focus and purpose of

the measurement study. If the protocol we want to observe (e.g. DNS) operates under a default port (e.g., port 53) we can use port-based filters. If the target application is a P2P protocol that uses ephemeral port number then we can choose to use a content-based or flow-level filter. Throughout the dissertation, we will show examples and different applications of all the filters we defined above.

(25)

2.1.2 Quantifying TDGs

Graph metrics are used to describe and compare graphs. The set of metrics presented in this dissertation are divided in two groups: (a) static or unary metrics metrics that operate on a single graph, and dynamic or binary metrics that compare between graphs. In this section, we emphasize on static metrics. The analysis of dynamic metrics is the topic of chapter §5. Several metrics introduced here are not commonly used in the measurement community currently, and our choices represent experience based on time-consuming trial and error.

Let us consider a TDG G(V, E), with V the set of nodes and E the set of edges. We use |X| to denote the cardinality of a set. For any edge (u, v) the nodes u, v are called the endnodes of the edge. In Figure 2.1 we show a graph visualization example of a TDG snapshot. The main goal of the metrics we describe next is to translate the visually meaningful aspects of a graph into quantitative measures that can automate the processes of comparing graphs (§5.3) and detecting changes (§5.4).

TDGs are typically directed, and hence we can group nodes based on the direction of their edges. Sinks (Vsnk) are nodes that have only incoming edges and sources (Vsrc) are

nodes that have only outgoing edges; we also refer the set of nodes having both incoming and outgoing connections as VInO, or In-and-Out (InO). These subsets VInO, Vsnk, and Vsrc

partition V (so |V | = |VInO|+|Vsnk|+|Vsrc|). We say that a pair of nodes u, v has a bidirectional

edge if and only if (u, v) ∈ E and (v, u) ∈ E. To quantify the symmetry of a graph, we use

the percentage of communicating node-pairs that have a bidirectional edge (BiDir). To quantify the connectivity of a graph, we use the size of its Largest Weakly Connected Component (LWCC). If we consider again the graph as undirected, the LWCC is the size of the largest connected component; we report these quantities as a percentage of the total number of nodes in the graph. The LWCC for the graph visualization example of Figure 2.1 is located in the middle of the figure and is highlighted with darker colored

(26)

edges (better viewed on a computer screen or colored print-out). From the figure we see that TDG snapshots can be disconnected. We have observed that measuring the size of the top 10 LWCCs is a good metric for characterizing TDGs as we show in §5.4 in more detail. The neighborhood of a node u is the set of nodes adjacent to u and is denoted by Γ(u). The degree of u is defined as d(u) = |Γ(u)|. The minimum endnode degree (MED) of an edge e = (u, v), is defined as MED(e) = min(d(u), d(v)). We define the average

degreeof a TDG as ¯k = P_u∈V d(u)/|V |. We denote the number of nodes with degree k

as n(k). We denote the maximum degree of the graph as kmax. The degree distribution

of a graph captures the probability that a randomly selected node has degree k, and is defined by P (k) = n(k)/n for k = 1, kmax. The entropy of the degree distribution H(X)

is defined as −P_k=1,kmaxP (k) log(P (k)), with each term in the sum being 0 if P (k) = 0.

For measuring the uniformity of the distribution, we use the Relative Uncertainty (RU); given by RU = H(X)/ log2kmaxas defined in [9]. (The maximum entropy is achieved with

the uniform distribution, so an RU value of 1 denotes the uniform distribution.) Note that all the above metrics can also be defined for the marginal distributions of incoming and outgoing edges of nodes, e.g., in-degree distributions, average in-degree, etc.

The distance between two nodes is defined as the length of their shortest path in the graph. The diameter of a graph is defined as the maximum distance between all pairs of nodes, which is sensitive as a metric [18]. For a more robust metric, we use the effective

diameter(EDiam), which we define as the 90-th percentile of all pairwise distances in the

graph.

The assortativity coefficient r is a summary metric of the correlation of the degree be-tween endpoints of edges in the graph [19]. The value is the Pearson correlation coefficient of the degrees of the endnodes of edges and lies in the range [−1, 1]. If r = 0, the graph ap-pears to have random degree correlations, and there is no linear relationship between the degrees of neighbor nodes. If r > 0 then the graph is assortative, and high degree vertices are likely to connected to other high degree vertices. Conversely, if r < 0 then the graph is

(27)

disassortative, and high degree vertices are likely to connect to low degree vertices.

2.2 Datasets

In this dissertation we go beyond providing novel ideas on analyzing network-wide inter-actions to showing how well our methodologies work in the real-world. To facilitate this research, we use traffic collected from six different backbone locations allowing us to study TDGs and evaluate our methodologies over multiple vantages points on the Internet.

• PAIX: This location offers data collected from an OC48 link of a commercial large US Tier-1 ISP at the Palo Alto Internet eXchange (PAIX), connecting San-Jose with Seattle. This locations provides the largest trace in our dataset in terms of traffic volume, observed flows, and distinct IP addresses. The monitor captured traffic from both directions of the traffic link. In addition, the dataset contains up to 16 bytes of payload from each packet.

• MFN: This traces contains traffic from a peering link of a large ISP in the west-coast US. The monitor captured traffic from both directions of the traffic link. In addition, the dataset contains up to 16 bytes of payload from each packet. The MFN and PAIX traces are kindly provided by CAIDA (www.caida.org).

• WIDE: The data are collected from a low bandwidth (100 Mbps Ethernet) transpa-cific backbone link connecting the US with WIDE (Japan) and carries commodity traffic of the WIDE member organization. The trace contains traffic from both di-rections of the link. For each packet, it contains full packet header and 40 bytes of payload. This trace is kindly provided by MAWI (mawi.wide.ad.jp/mawi/).

• BRAZ: This location set represents a smaller backbone link. The captured traffic is from a 1 Gbps Ethernet link that connects several small ISP (residential users) to a

(28)

larger provider. This trace is closer to the edge of the network and captures all the public traffic of the small ISPs. In addition, the small ISPs are using NAT extensively, which makes this trace different from others. This trace is kindly provided by Narus

(www.narus.com_).

• KSCY: An OC48 link from the the Abilene (Internet2) academic network connecting Kansas City to Indianapolis. The traffic comes from academic and research insti-tute and contains large amount of P2P as well as experimental traffic from project such as Planet Lab. Unfortunately, from this location we do not have any payload information. The monitor captures traffic from both directions of the link.

• CLEV: An OC48 link from the the Abilene (Internet2) academic network connecting Cleveland to Indianapolis. The traffic comes from academic and research institute and contains large amount of P2P as well as experimental traffic from project such as Planet Lab. Unfortunately, from this location we do not have any payload infor-mation. The monitor captures traffic from both directions of the link. The KSCY and CLEV are kindly provided by NLANR and CAIDA (www.caida.org).

Basic statistics about the backbone traces that we use in this study are presented in table Table 5.1. The statistics summarize the volume of traffic traversing the links over a five-minute period. The most busy traffic load is observed at the PAIX link and the one with the lowest utilization is WIDE. The presence of high volume of DNS traffic is responsible for the high number of packets despite the low utilization of the WIDE link.

Are our traces representative?This is a question that can haunt any trace-driven study. We find that the use of four different traces at significantly different locations provides a reasonable sample space. Moreover, their traffic volume varies (see Table 5.1). To gain access to these backbone data with real-IPs and payload, required several months and signing privacy agreements with three different organizations. We will continue to look for new backbone traces to run our algorithms, but this is clearly not a trivial task.

(29)

Table 2.1: Basic traffic statistics for each of the six backbone locations we include in this dissertation. The statistics correspond to a typical five-minute traffic slice.

PAIX MFN WIDE BRAZ KSCY CLEV

Unique IP addresses 258,636 480,637 114,824 402,309 198,752 232,579 Flows 888,750 3,375,899 263,325 787,783 1,782,979 1,857,709

Packets 1,529M 1,414M 1,432 M 1,002M 886M 1,008M

Utilization (Mbps) 997.1 960.8 27.6 304.47 789.4 863.7

Payload Yes Yes Yes Yes No No

Ground truth

Extracting the ground truth (GT) of flows is required for two reasons: (a) It enables the accurate isolation of flows belonging to different applications, which allows us to study the network-wide behavior (TDGs) of each individual application. (b) It helps us to eval-uate how accurate is the prediction of the application class of different flows in our traffic classification study. In this dissertation, to extract the ground truth we use a Deep Packet Inspection (DPI) method that matches payload-based signatures for different applications. Such payload based methods are the standard for extracting the ground truth for all cur-rent traffic profiling methods [20, 1, 21].

We will refer to the ground truth classifier as Payload-based Classifier, or PC for short. The tool we use here was initially developed by Karagiannis et al. [22] and later enhanced in [10, 1] in include more application signatures. The tool works as follows. Because the traces only have payload for the first 16-40 bytes of each packet, the PC keeps a set of known signatures for each application class and tries to find exact match for each flow by looking at the first 40 bytes. The signatures cover several applications, such as DNS, Web, SMTP, FTP, BitTorrent, Soribada, eMule, KaZaa, etc. More details on the how PC works and the exact set of signature we refer the reader to [10, 22]. Payload classifiers report as “Unknown” flows that do not match any of known signature (e.g., due to encryption). Also, payload classifiers cannot classify flows that do not carry any payload. Such zero payload flows exist because of worm scanning activity and other failed TCP connections.

(30)

Since we are using payload classifiers as our ground-truth providers, we do not use these unknown and unclassified flows in our evaluation.

2.3 Defining the Traffic Profiling/Classification Problem

Throughout this dissertation, we refer to this problem using the terms “traffic classifica-tion” and “traffic profiling” interchangeably. The goal of a traffic profiler/classifier is to assign application labels (e.g., Web, DNS) to all the flows in a packet trace. In this disser-tation, we are interested in profiling traffic at the backbone, which is a more challenging problem. Moreover, other methods such as BLINC [1], are known to perform good at the edge but poor at the backbone, as we discuss in §3 in more detail. The most important class is P2P traffic which is the hardest to detect [10, 21, 1]. Therefore, a method with high P2P accuracy is highly desirable in practice [10].

In all chapters, we report all classification metrics in terms of flows. We compare the flow-labeling of all our methods to the ground-truth (i.e., the set of flows that are “known” to our PC as defined above). In traffic classification, the classes correspond to different network applications (e.g., BitTorrent, Web, FTP) and the classification instances are 5-tuple flows (as defined at the beginning of the chapter). As in any classification problem, we use the standard performance metrics as follows:

• Coverage: Percentage of flows being labeled (even those incorrectly labeled).

• Overall accuracy: Percentage of accurate labels over all flows. This metric gives the probability of a correct prediction to any randomly selected flow.

• Precision, Recall, and F1-score per application class: We compute the True Positives, False Positives, and False Negatives. The True Positives (TP) measures how many instances of a given class are correctly classified; the False Positives (FP) measures how many instances of other classes are confused with a given class; and the False

(31)

Negatives (FN) measures the number of misclassified instances of a class. In our comparisons, we used the following standard metrics: Precision (P), defined as P = T P/(T P + F P ); Recall (R), defined as R = T P/(T P + F N); and the F-Measure [14], defined as F = 2P · R/(P + R), combining P and R.

(32)

Chapter 3

(33)

Graphs have been used before to represent various interactions in network settings. For example, trust propagation networks and other social networks (e.g., [23]) are often expressed as graphs. However, none of the previous work explored the full potentials of TDGs for traffic analysis at the extend we do in this dissertation. Moreover, here we go one step further and apply our techniques to a range of applications, such the profiling of traffic and the identification of polymorphic blending, which have not been addressed before using network-wide behaviors. We provide an overview and we compare out work with the most closely related work below.

3.1 Traffic Profiling/Classification

Traffic classification has attracted significant attention, but as we explain bellow none of the existing methods address all relevant problems.

We group classification methods according to their level of observation: (a) port-based, using well-known port numbers [24], (b) flow-based, using supervised [25, 26, 16, 27, 28] and unsupervised [15, 17, 7, 29], Machine Learning (ML) techniques, (c) host-based [1, 30, 9], and (d) payload-based [31, 5, 20, 32, 6].

Port-based: A representative classified of this category is CoralReef [24]. These

clas-sifiers reply on well-known port numbers to classify flows. For example, if we have three flows, one on port 80, other on port 53, and another on port 25, they will be classified as Web, DNS, and SMTP, respectively. Today, is very easy to change the port-number used by different applications (e.g., BitTorrent, Skype) to be of any random port number, in-cluding numbers used by well known applications. Therefore, port-based solutions are not reliable any more [31].

Flow-based: To overcome the limitations of port-based solutions, Other approaches

use machine learning (ML) algorithms to classify traffic using flow features (e.g, packet sizes). For an exhaustive list and comparison of ML algorithms we refer the reader to [8]

(34)

and [10]. The problem of training a ML algorithm on one trace and applying it on another was recently addressed in [33]. Even though supervised solutions to traffic classification are continuesly improving, all supervised methods require per application training and will thus not detect traffic from new applications. Our work has more in common with unsupervised methods which group similar flows together. All previous methods [15, 17, 7] require manual labeling of clusters. Our work with Graption (§4) bridges this gap by providing a method to automatically label clusters of flows based on their network-wide behavior.

Host-based: In BLINC [1], the authors characterize the connection patterns (e.g., if it

behaves like using P2P) of a single host at the transport layer and use these patterns to label the flows of each host. BLINC uses graph models called graphlets to model a host’s connection patterns using port and IP cardinalities. Unlike TDGs, graphlets do not repre-sent network-wide host interaction. In some sense, TDGs reprerepre-sent a further level of aggre-gation, by aggregating across hosts as well. Thus it is perhaps fair to say that while BLINC hints at the benefit of analyzing the node’s interaction at the “social” level, it ultimately follows a different path that focuses on the behavior of individual nodes. As we show, both of our approaches (§4,6) performs better than BLINC in our backbone traces. Similar to BLINC, other host-based methods [34, 35] target the identification of P2P users inside a university campus (i.e., network edge). The connection patterns of neighboring host (e.g., their degree distribution) were also used as features to profile network hosts [36]. Unlike Graption-P2P, in [34, 35, 1, 36] they do not use network-wide host interaction. In [37], the authors use a port-based method to identify P2P users, using their temporal appearance and connection patterns in a trace.

A different approach to host profiling was introduced by Trestian et al. [21]. They used readily available information from the Web to classify traffic using the Google search engine. They show very good results for classifying flows for legacy application, but their results are not promising for P2P detection because of the dynamic nature of P2P IP hosts.

(35)

Our method can thus be used to complement the work in [21].

Payload-based: Payload-based techniques were introduced in [31, 5, 20] to detect P2P

applications. Using available documentation and manual reverse engineering, these ap-proaches extract signatures for various P2P applications. Our payload classifier (PC) de-scribed in §2.2 adopts this methodology. These early methods also support the observation that the first few bytes (packets) are sufficient to classify flows. More recent efforts [32, 6] use the first 64 bytes of each flow’s payload as a feature for traffic classification. Their findings also confirm that the first few payload bytes are sufficient. In [32], payload data are used to train classifiers. Ma et al. [6] used payload similarity to group similar flows, thereby simplifying the process of flow labeling by a network administrator. Both pa-pers [32, 6] focus on the detection of conventional applications and not P2P applications.

3.2 Detecting Traffic Anomalies and Worms

The first use of TDGs we know of was for the detection of worm activities within enter-prise networks [38, 13]. Their main goal was to detect the tree-like communication struc-ture of worm epidemics within an enterprise. This characteristic of worms was also used for post-mortem trace analysis using backbone traces [39]. More recent studies use graph techniques to detect hit-list worms within an enterprise network, based on the observation that an attacker will alter the connected components in the network [12].

Tools for the analysis and visualization of network traffic, such the Autofocus tool [40] and Plonka’s FlowScan [41] can infer volume-based anomalies by highlighting patterns of large resource consumption in network data. Our work can be seen as a complementary approach to these tools, by providing a means to visualize, understand, and analyze the static and dynamic network-wide behavior of applications.

Polymorphic Blending (PB) Attacks. These attacks have been introduced by Fogla

(36)

an attacker mimics the behavior of legitimate traffic, it can be very hard for an intrusion detection system (IDS) to identify the intruder. This is similar to the adversarial classification problem [43] where an adversary uses its knowledge about the IDS in order to constantly change its profile and evade detection. A general solution to the problem does not ex-ists. In [42] they suggest a method to address the problem by applying multiple IDSes that operate at different levels. Even though anomaly detection is a well studied research area [44, 45], the problem of detecting polymorphic blending attacks by P2P applications is new and to the best of our knowledge has not been addressed before. Most prior efforts on anomaly detection focused on changes in resource consumption [44], without moni-toring the behavior of particular applications. The problem of profiling applications in the backbone was the topic in [9], where they used entropy to summarize distributions and group related applications (e.g., client-server) together. However, the authors in [9] did not address the detection of changes in the profile of network applications. We have included entropy metrics in our feature set for profiling network traffic. Indeed, we build on prior efforts on application profiling [9] and anomaly detection [44] by utilizing unary and binary metrics for TDGs (see §5 for details).

3.3 Measurement Studies on Network-Wide Interactions

Passive monitoring of P2P overlays is studied by Sen et al. [46], targeting mainly the pro-filing of P2P hosts, including the measurement of bandwidth usage, how long they remain active, etc. The goal of the measurement is to support traffic engineering and not for pro-filing the application. A similar study for large DNS traces [47] uses graphs in the context of classifying DNS servers according to their role in the DNS-hierarchy and for generating a space-efficient DNS traffic summary. Moreover, neither work uses the dynamic nature of the graphs as we define them here.

(37)

networks for the extraction of Communities of Interest (CoI) [48, 49, 50, 51]. Any particu-lar CoI will contains enterprise hosts with simiparticu-lar behavior (e.g., connections) and habits (e.g., a common mail server). In [49], graphs are used as a means of modeling connections and grouping similar hosts within corporate networks. Again, these papers differ substan-tially from ours in that we focus on using graphs to understand and model network-wide behavior of Internet applications.

Statistical methods are used in [9] for automating the profiling of network hosts and ports numbers. The connectivity behavior and habits of users within enterprise networks is the focus of many papers, including [49]. In [46], the authors study P2P overlays us-ing passive measurements, but target mainly the profilus-ing of P2P hosts. The most resent work on network-wide interactions in by Jin et al [52]. In their work [52], they use graph-partitioning methods to extract and study the evolution of smaller communities within a TDG. None of the above papers targets P2P detection. A recent work by Jin et al. [52] uses graph-partitioning methods to extract and study smaller communities (subgraphs) within a TDG. Such communities can represent communications between known servers (e.g., Google) and clients accessing the particular service. In [52], they also study the temporal properties of the subgraphs showing that they are persistent over time. The work in [52] is different from our study of the dynamic characteristics of TDGs where the changes of the entire graph are used to describe the dynamic behavior of an application (e.g., DNS).

A work by Latapy et al. [53] measures the evolution of TDG-like graphs between all the hosts exchanging a single packet of any type. The high aggregation of this graph is very different from our separate view of the traffic generated by different applications. Meiss et al. [54] used sampled Web flows to extract statistics for the behavior of clients and servers regarding their cardinalities and the level of traffic exchanged between them. While simi-lar in spirit, we believe our work greatly expands and improves on these approaches.

(38)

Chapter 4 Profiling the Network-Wide Behavior of

Applications

(39)

4.1 Overview

The main goal in this chapter is to identify the key structural differences in the TDGs of different network applications. We then focus to the more specific problem of identifying traffic from P2P applications. Intuitively, with TDGs we enable the detection of network-wide behavior (e.g., highly connected graphs) that is common among P2P applications and different from other traffic (e.g., Web). Towards this end, we propose a classification framework, dubbed Graption (Graph-based classification), as a systematic way to com-bine network-wide behavior and flow-level characteristics of network applications. Grap-tion first groups flows using flow-level features, in an unsupervised and agnostic way, i.e., without using application-specific knowledge. It then uses TDGs to classify each group of flows. As a proof of concept, we instantiate our framework and develop a P2P detec-tion method, which we call Grapdetec-tion-P2P. Compared to other methods (e.g., BLINC [1]), Graption-P2P is easier to configure and requires fewer parameters.

The key findings of the chapter can be summarized in the following points:

• Distinguishing between P2P and client-server TDGs. We use real-world backbone traces and derive graph theoretic metrics that can distinguish between the TDGs formed by client-server (e.g., Web) and P2P (e.g., eDonkey) applications. Section: §4.2.1.

• Practical considerations for TDGs. We show that even a single backbone link con-tains enough information to generate TDGs that can be used to classify traffic. In addition, TDGs of the same application seem fairly consistent across time. Section: §4.2.2.

• High P2P classification accuracy. Our framework instantiation (Graption-P2P) clas-sifies 90% of P2P traffic with 95% accuracy when applied at the backbone. Such traces are particularly challenging for other methods. Section: §4.3.2.

(40)

• Comparison with a behavioral-host-based method. Graption-P2P performs better than BLINC [1] in P2P identification at the backbone. For example, Graption-P2P identifies 95% of BitTorrent traffic while BLINC identifies only 25%. Section: §4.3.3.

• Identifying the unknown. Using Graption, we identified a P2P overlay of the Slap-per worm. The TDG of SlapSlap-per was never used to train our classifier. This is a promising result showing that our approach can be used to detect both known and unknown P2P applications. Section: §4.3.4.

4.2 Profiling TDGs of P2P Applications

Our first step is to visually compare the TDGs of different applications. In Figure 4.1, we show TDG examples from two different protocols. In order to motivate the discussion in the rest of the chapter, we show the contrast between a P2P and a client-server TDG. From the figure we see that P2P traffic forms more connected and more dense graphs compared to client-server TDGs. In §4.2.1, we show how we can translate the visual intuition of Figure 4.1 into specific graph metrics that can be used to classify TDGs that correspond to different applications.

Datasets. The traces we use in this chapoter summarized in Table 4.1. For more details

on the backbones links used here (PAIX and CLEV) please refer to §2.2. Some of the traces described in §2.2 were available to us for a limited time and for a specific project and are therefore not considered in this chapter. We use the Payload-based Classifier (PC) to estab-lish the ground truth of flows for the TR-PAY1 and TR-PAY2 traces using the methodology described in §2.2. Running the PC over the TR-PAY1 and TR-PAY2 traces we find 14% of the traffic to be P2P, 27% Web, 7% DNS, and the rest to belong to other applications. A detailed application breakdown is summarized in Table 4.2. In the table we further report the traffic information for the top six P2P applications with the highest number of flows in our data set. These six applications contribute ∼ 95% of the flows and ∼ 75% of the

Analyzing Network - Wide Interactions Using Graphs: Techniques and Applications

ACKNOWLEDGMENTS

ABSTRACT OF THE DISSERTATION

Contents

List of Figures

Chapter 1

Chapter 2

Background

2.1

Traffic Dispersion Graphs (TDGs)

2.1.1

Edge Filters

2.1.2

Quantifying TDGs

2.2

Datasets

2.3

Defining the Traffic Profiling/Classification Problem

Chapter 3

3.1

Traffic Profiling/Classification

3.2

Detecting Traffic Anomalies and Worms

3.3

Measurement Studies on Network-Wide Interactions

Chapter 4

Profiling the Network-Wide Behavior of

Applications

4.1

Overview

4.2

Profiling TDGs of P2P Applications