IDENTIFYING APPLICATION PROTOCOLS IN COMPUTER NETWORKS USING VERTEX PROFILES. Edward G. Allan, Jr.

(1)

By

Edward G. Allan, Jr.

A Thesis Submitted to the Graduate Faculty of WAKE FOREST UNIVERSITY

in Partial Fulfillment of the Requirements for the Degree of

MASTER OF SCIENCE

in the Department of Computer Science December 2008

Winston-Salem, North Carolina

Approved By:

Errin W. Fulp, Ph.D., Advisor Examining Committee:

David J. John, Ph.D., Chairperson William H. Turkett, Jr., Ph.D.

(2)

Acknowledgements

This thesis is the product of many people’s labors, not just my own. The ideas contained in the pages that follow have been formulated and refined for over a year, with the guidance and support of several people, whose assistance I would be remiss not to mention. I would like to thank Wake Forest University and GreatWall Systems, Inc. for their support. This research was funded by GreatWall Systems, Inc. via the

United States Department of Energy STTR grant DE-FG02-06ER86274. 1

I would also like to thank my parents for their support throughout my years at Wake Forest, both as an undergraduate and as a graduate student. Without their encouragement and financial assistance, none of this would have been possible. I also would not be where I am today without the help of my friends, who have made these past several years some of the most enjoyable and most memorable yet.

My thesis committee members, Dr. David John and Dr. William Turkett, Jr., were instrumental in providing me with feedback throughout the research and writing process. Their comments and criticism have undoubtedly enabled the success of this endeavor. I would especially like to thank Dr. Turkett for selflessly spending hours assisting me and stepping in as my “adopted advisor” during Dr. Errin Fulp’s sabbatical.

Last, but certainly not least, I must thank my advisor, Dr. Errin Fulp. I have been fortunate to work with him in a variety of contexts for more than five years now, and he has been a tremendous influence on both my personal and academic development. His relaxed personality and great sense of humor kept me off-task just enough to save my sanity, while his insight and guidance allowed me to complete my studies and be ready to move on to the next chapter in my life. Many thanks again to all who have helped me along the way — you are much appreciated.

1_{The views and conclusions contained herein are those of the author and should not be interpreted} as necessarily representing the official policies or endorsements, either expressed or implied, of the DOE or the U.S. Government.

(3)

Acknowledgements . . . ii

Illustrations . . . vi

Abbreviations . . . viii

Abstract . . . x

Chapter 1 Introduction . . . 1

1.1 Issues in Network Management and Security . . . 2

1.2 Current Methods of Network Analysis . . . 2

1.2.1 Applications and Port Numbers . . . 3

1.2.2 Packet Inspection . . . 4

1.3 Interdisciplinary Study of Network Communications . . . 4

1.3.1 Social Networks . . . 5

1.3.2 Biological Networks and Motifs . . . 6

1.4 Outline . . . 7

Chapter 2 Computer Networks and Communications. . . 8

2.1 Network Topologies and Architectures . . . 8

2.2 Computer Network Reference Models . . . 10

2.2.1 The OSI Model . . . 10

2.2.2 The TCP/IP Model . . . 12

2.3 Layer 3: The Network Layer . . . 13

2.4 Layer 4: The Transport Layer . . . 13

2.5 Layer 7: The Application Layer . . . 14

Chapter 3 Graph Analysis . . . 16

3.1 Graph Terminology and Basic Properties . . . 16

3.2 Types of Graphs . . . 17

3.3 Traditional Graph Measures . . . 18

3.3.1 Distances and Path Lengths . . . 18

3.3.2 Centrality Measures . . . 19

3.3.3 Clustering Coefficient . . . 21 iii

(4)

iv

3.3.4 Application of Traditional Graph Measures in Computer

Net-works . . . 22

3.4 Network Motifs . . . 22

3.4.1 Definition of a Motif . . . 23

3.4.2 Function of Motifs . . . 24

3.5 Analysis of Application Graphs . . . 25

Chapter 4 Data Selection and Considerations . . . 26

4.1 Network Trace Files . . . 26

4.2 Challenges Associated with Network Data Collection . . . 26

4.2.1 Data Capture . . . 27

4.2.2 Privacy and Sanitization of Data . . . 28

4.2.3 Network and Data View . . . 29

4.3 Data Sources . . . 30

4.3.1 Dartmouth College Wireless Traces . . . 31

4.3.2 LBNL/ICSI Enterprise Tracing Program . . . 31

4.3.3 OSDI Conference Network Traces . . . 31

4.4 Protocol Selection . . . 32

Chapter 5 Experimental Methodology . . . 36

5.1 Hardware and Linux System . . . 36

5.2 Packet Capture and Storage . . . 37

5.3 Creation of Application Graphs . . . 37

5.4 Traditional Graph Measures . . . 39

5.5 Motif Analysis . . . 40

5.6 Vertex Profiles . . . 43

5.7 K-Nearest Neighbor Classification . . . 44

5.7.1 Measuring Profile Separation . . . 45

5.7.2 Cross Validation of Classification Results . . . 46

5.8 Genetic Algorithm Feature Weighting . . . 46

5.8.1 Overview of Genetic Algorithms . . . 47

5.8.2 Feature Weighting . . . 48

Chapter 6 Results and Analysis . . . 49

6.1 Preliminary Investigations . . . 49

6.2 Initial Results . . . 50

6.2.1 Traditional Graph Measure Profiles . . . 51

6.2.2 Motif-based Profiles . . . 54

(5)

6.3.1 Attribute Weights of Traditional Graph Measures . . . 58

6.3.2 Attribute Weights of Motif-based Measures . . . 59

6.4 Comparison of Profile Types . . . 61

6.5 Considerations for Optimizing Classifier Performance . . . 63

6.6 Limitations of Current Approach . . . 66

Chapter 7 Conclusions and Future Work . . . 67

References . . . 71

Appendix A Examples of Application Graphs . . . 76

Appendix B Code Listings . . . 78

Appendix C Test Parameters . . . 85

Appendix D Additional Classification Results . . . 87

(6)

Illustrations

List of Tables

4.1 Summary statistics of three trace files examined . . . 31

5.1 Graph orders for each application protocol . . . 38

6.1 Classification accuracy of 65 application graphs . . . 50

6.2 An example confusion matrix with three classes . . . 50

6.3 Confusion matrix of unweighted traditional graph measures . . . 52

6.4 Number of single and multi-class ties for traditional graph measures 53 6.5 Confusion matrix of unweighted motif-based profiles . . . 55

6.6 Number of single and multi-class ties for motif-based profiles . . . . 55

6.7 Percentage of original data used in motif-based profiles . . . 57

6.8 Attribute weights for traditional graph measures . . . 58

C.1 FANMOD test parameters . . . 85

D.1 Confusion matrix of 65 application graphs using motif frequencies . . 87

D.2 Confusion matrix of weighted traditional graph measures . . . 87

D.3 Confusion matrix of weighted motif profiles . . . 87

List of Figures

1.1 Example output from NetStat . . . 3

1.2 Graphical depiction of a social network with two distinctly visible clus-ters . . . 6

2.1 Four network topologies: bus, ring, star and mesh [1] . . . 9

2.2 The OSI and TCP/IP reference models [2] . . . 11

2.3 An IP datagram header [2] . . . 13

2.4 UDP and TCP datagram headers [2] . . . 14

2.5 Example communication between a client and a web server . . . 15

3.1 A graph with five nodes and five edges . . . 17

3.2 Schematic view of motif detection [3] . . . 23

3.3 All 13 configurations of order 3 connected subgraphs [3] . . . 24 vi

(7)

3.4 A feed-forward loop . . . 24

4.1 Tcpdump output containing timestamp, protocol, source IP, source port, destination IP, destination port, packet length and packet flags 27 5.1 Overview of the proposed methodology and tools used . . . 36

5.2 Storing packets from a pcap file into a MySQL database . . . 37

5.3 A motif with colored vertices . . . 41

5.4 FANMOD edge-switching process for generating random networks [4] 42 5.5 Arrays representing vertex profiles . . . 43

5.6 Single-point crossover of two binary strings . . . 48

6.1 Profile collisions for traditional graph measures . . . 54

6.2 Profile collisions for motif-based profiles . . . 56

6.3 Depiction of three application graphs: HTTP, AIM and SSH . . . 57

6.4 Accuracy of unweighted vs. weighted traditional graph measure profiles 59 6.5 The ten highest-weighted motifs and their corresponding weights . . 60

6.6 Accuracy of unweighted vs. weighted motif-based profiles . . . 61

6.7 Accuracy comparison of unweighted profile types . . . 62

6.8 Accuracy of single attribute classification . . . 64

6.9 Comparison of profile types as the size of the training set increases . 65 A.1 Application graphs depicting AIM communications . . . 76

A.2 Application graphs depicting DNS communications . . . 76

A.3 Application graphs depicting HTTP communications . . . 76

A.4 Application graphs depicting Kazaa communications . . . 77

A.5 Application graphs depicting MSDS communications . . . 77

A.6 Application graphs depicting Netbios communications . . . 77

(8)

Abbreviations

Acronyms

AIM- AOL Instant MessengerTM

API- Application Programming Interface

AUP- Acceptable Use Policy

DNS- Domain Name Service

FFL - Feed-forward loop

HTTP- HyperText Transfer Protocol

IANA- Internet Assigned Numbers Authority

IDS - Intrusion Detection System

IP - Internet Protocol

MSDS - Microsoft Directory Share

OSI- Open Systems Interconnection

P2P- Peer-to-peer

SANSTM _{- SysAdmin, Audit, Networking, and Security}

SMTP - Simple Mail Transfer Protocol

SSH - Secure Shell

TCP - Transmission Control Protocol

UDP- User Datagram Protocol

VoIP- Voice over IP

(9)

Symbols

|V | is the number of vertices in a graph eij is an edge from vertexi to vertex j

deg(v) is the degree of vertex v id(v) is the indegree of vertex v od(v) is the outdgree of vertex v

N(v) is the set of nodes in the neighborhood of vertex v

e(v) is the eccentricity of vertex v rad(G) is the radius of graph G

diam(G) is the diameter of graph G

d(u, v) is the distance between vertex u and vertex v CD(v) is the degree centrality of vertex v

CB(v) is the betweenness centrality of vertex v

CC(v) is the closeness centrality of vertex v

xi is the eigenvector centrality of vertexi

C(v) is the clustering coefficient of vertex v

(10)

Abstract

Edward G. Allan, Jr.

Identifying Application Protocols in Computer Networks Using Vertex Profiles

Thesis under the direction of Errin W. Fulp, Ph.D., Associate Professor of Computer Science

Security and management of computer network resources exemplify two critical activities that challenge system administrators. They face potential threats from out-side intruders as well as internal users whom already have access to the organization’s assets. It is imperative that administrators are aware of what applications are being executed, but the use of data encryption techniques and non-standard port numbers presents difficulties that must be overcome.

To that end, this thesis introduces a novel method to identify application protocols based on the analysis ofapplication graphs, which model application-level communica-tions between computers. The performance of two types of node descripcommunica-tions, called vertex profiles, are compared. “Traditional” vertex profiles characterize each node using several well-studied graph measures. Furthermore, this work uniquely applies motif-based analysis, which has previously been used primarily in systems biology, to the study of application graphs by creating a second type of vertex profile based on a node’s participation in statistically significant motifs. Machine learning techniques are employed to evaluate the importance of specific profile features. The experimen-tal results, using a nearest-neighbor classifier, show that this type of analysis can correctly classify the applications observed with greater than 80% accuracy.

(11)

Managing and securing today’s critical data networks is a daunting and expensive task. According to INPUT [5], demand for vendor-furnished information systems and services by the U.S. government will increase from$71.9 billion in 2008 to$87.8

billion in 2013. This money funds such tasks as system modernization, information sharing, IT management and information security. As computer networks increase in size, speed and complexity, and malicious hackers develop more sophisticated attacks, traditional methods of managing and securing these networks begin to break down. This thesis proposes a novel approach to identifying the actions of hosts within a network by examining the properties ofapplication graphs, which model the social and functional interactions of hosts with one another at the software application level (e.g. HTTP, FTP, etc.). With the aid of machine learning techniques and algorithms, this method exploits graph characteristics of each host in the application graph, such as its connectedness, its position in the graph and the shapes of the subgraphs in which it is found. One distinct advantage to this approach is that classification can be performed “in the dark”, meaning that the packet payloads are either unavailable or have been encrypted, rendering deep packet inspection futile. Knowing what activities users on the network are participating in is crucial to network administrators who must manage bandwidth allocations, network configurations, performance and security and access policies. The following sections of this chapter provide background information and motivation for the study.

(12)

2

1.1

Issues in Network Management and Security

To protect itself from litigation and to help ensure the integrity of its network, an organization (such as a school, business, or government) will often develop an Accept-able Use Policy, or AUP. An AUP defines what behaviors are acceptAccept-able for internet browsing, what applications can be run by users and other relevant guidelines for usage. The SANS Security Policy Project [6] provides several resources and tem-plates for such policies. Take, for example, a policy that does not allow users to run a personal web server using an organization’s computing resources. Identifying such behavior can help to preserve network bandwidth that is otherwise used for legitimate business activities.

Not only can failure to comply with an organization’s AUP waste computing resources, it can also have serious security implications as well. Continuing with the example above, running an improperly configured web server or hosting insecure web application files gives an attacker an easy point of entry into the network. A study performed by MITRE from 2001-2006 notes a sharp increase in the number of public reports for vulnerabilities that are specific to web applications [7]. For several years buffer overflow attacks had been the most common, but were overtaken in 2005 by web application vulnerabilities such as SQL injection, cross-site scripting (XSS) and remote file inclusion. It is, therefore, in a network administrator’s best interest to ensure that the network is properly utilized in accordance with the policies and guidelines adopted by the organization.

1.2

Current Methods of Network Analysis

Several tools allow system administrators to determine which applications are being used on a network. This information assists them in the maintenance and protection

(13)

of networked systems. Sophisticated users, however, are able to hide their activities, which could potentially include actions that are against the organization’s AUP, or worse yet, are illegal. This section examines a few of the tools used by administrators and identifies some of their weaknesses.

1.2.1

Applications and Port Numbers

When data is sent to a computer over a network, the destination port number identifies which application on the host computer should receive and process the data. Many applications use port numbers specified by the Internet Assigned Numbers Authority [8]. For example, FTP servers use ports 20 and 21, while web servers use port 80 by default. NetStat is a command line tool that shows information about network connections, both incoming and outgoing [9]. Figure 1.1 demonstrates the output of the NetStat command.

$ netstat -ta

Active Internet connections (servers and established)

Proto Recv-Q Send-Q Local Address Foreign Address State

tcp 0 0 localhost:2208 *:* LISTEN tcp 0 0 *:sunrpc *:* LISTEN tcp 0 0 *:auth *:* LISTEN tcp 0 0 *:35763 *:* LISTEN tcp 0 0 localhost:ipp *:* LISTEN tcp 0 0 localhost:smtp *:* LISTEN tcp 0 0 localhost:36699 *:* LISTEN tcp6 0 0 *:ssh *:* LISTEN

Figure 1.1: Example output from NetStat

Network administrators could look and see that a host on the network is listening on port 80, indicating the presence of a web server. The administrator could then shut down that service and take appropriate disciplinary action toward the user. The problem with this method of detecting network applications is that while many do run on a known port number, they do not necessarily have to. If a web server were reconfigured to listen for connections on port 6000, clients could still connect

(14)

4 wishing to hide their activities might attempt to disguise an application by using such a non-standard port number. Chapter 2 describes port numbers and other networking concepts in more detail.

1.2.2

Packet Inspection

Another method of detecting network applications is to scrutinize the data contained in each packet as it traverses the network. Packets contain information such as HTTP requests, email headers and MP3 filename searches, as well as protocol-specific session initiations and version numbers that can be used to identify a particular application. Wireshark is a popular network protocol analyzer that has several useful features for viewing packet contents, reassembling sessions and gathering statistics about network data [10]. Packet inspection is commonly used in intrusion detection systems (IDS) such as Snort [11]. A rule-based engine searches packet data, compares it against a list of known attacks and generates a predefined response (such as notifying an administrator). The problem with packet inspection is that traffic is increasingly encrypted. Data payloads that have been transformed into cyphertext are not human-readable until they are decrypted with the appropriate key, nor do the payloads match the known attack strings in the case of IDS.

1.3

Interdisciplinary Study of Network Communications

It is therefore the goal of this study to look beyond current methods for identifying network behavior and propose a novel approach that relies upon high level commu-nication patterns observed among hosts. To accomplish this goal, this study borrows ideas and algorithms from several disciplines. Networks are not unique to computer science; they exist in mathematics, sociology, biology, communications and other ar-eas of study as well. Graphs, a collection of objects (sometimes called nodes) linked

(15)

by edges, are the abstract model which allows for the analysis of any type of network. They can represent relationships among friends, the interaction of biological entities in a transcriptional regulation network, the collaboration between authors of research papers [12], as well as a myriad of other problem spaces. Chapter 3 illustrates the properties of graphs in more depth.

1.3.1

Social Networks

One key area of study that this thesis borrows from is social network analysis, which focuses on relationships among social entities (also known as actors), and on the patterns and implications of these relationships [13]. The properties of social graphs reveal interesting information such as the spread of disease or material goods through the network, as well as what actors are “influential” (politically, socially, etc.). Social network analysis also has military and intelligence applications. Yang and Ng provide visualizations and analysis of weblog social networks related to terrorism and other crime-related matters [14].

To provide a simple working example of social network analysis, Figure 1.2 depicts the author’s social network of friendships taken from the popular social networking

web site FacebookTM_{. There are two clearly visible “clusters” of friends visible in}

the graph, created by nodes in each cluster sharing many common links with other nodes in the cluster. In the context of this social network, it means that many of the author’s friends in each group are also friends with each other. The group on the left is primarily comprised of relationships formed during the author’s tenure at Wake Forest University, while the cluster on the right is primarily comprised of relationships formed prior to and during high school.

Several concepts pertaining to social networks can be extended to the study of application graphs performed in this work. Application graphs model the social

(16)

rela-6

Figure 1.2: Graphical depiction of a social network with two distinctly visible clusters

tionships between clients and servers in a computer network by showing with which web servers users choose to interact, with whom they communicate via instant mes-saging clients and with whom they choose to share files. For example, the application

graph for AOL Instant MessengerTM _{might show several chat clients communicating}

with a central chat server, which then passes messages along to the intended recipi-ents. Characteristics of these high-level interactions are used to identify the software application through which the communication occurs. Section 3.3 elaborates upon the graph measures frequently used to quantify aspects of social networks.

1.3.2

Biological Networks and Motifs

The study of biological networks is another key field from which ideas for this thesis are borrowed. Cellular processes are regulated by the interactions of several molecules such as proteins and DNA [15]. These complex interactions can be modeled as graphs.

One particular method used to analyze these graphs is to search within them for

mo-tifs: recurring, significant patterns of interconnections. Milo et al. find motifs in several types of networks including biochemistry, neurobiology, ecology and engineer-ing. They suggest that motifs are the basic structural elements capable of defining broad classes of networks [3].

(17)

Motif analysis is often used in biology [3, 16, 17, 18], but has not yet been applied to application graphs. One goal of this study is to determine if a motif or groups of motifs can help identify what application a computer is using. It finds that several protocols use similar motifs, partly due to the fact that many applications have a client-server architecture (described in Section 2.1). However, there is still enough distinction in how the applications are used at a social level to determine what they are based on the models developed in this work. Chapter 6 discusses some of the motifs found in application graphs.

1.4

Outline

The following is an outline of the remaining parts of this thesis. Chapter 2 covers information regarding computer networks, the different reference models and details the network layers used to create application graphs. Chapter 3 introduces several concepts relating to graph theory, “traditional” measurement techniques of graphs and provides more information about motifs. Data sources and application protocol selection is covered in Chapter 4. Chapter 5 specifies the tools used in this thesis and introduces machine learning techniques used for the modeling and classification of application types. A discussion of the results obtained and an analysis of key motifs and graph metrics is handled in Chapter 6, as well as a comparison between traditional graph measures and a motif-based approach. Finally Chapter 7 concludes this study and explores possible topics for future research.

(18)

Chapter 2:

Computer Networks and

Communications

Undoubtedly the interconnection of computers and networks to the world wide web has increased mankind’s ability to share information, perform research and become more efficient at everyday tasks. However, not all users have benign intentions. Illegal hacking, cyber terrorism and fraud wreak havoc on governments, corporations and individuals alike. Data encryption is often used to disguise malicious activity as well as legitimate activity from observation. By exploring the communication patterns found within networks, this study shows that it is still possible to gain some insight into what applications are being utilized. The following sections introduce several basic concepts related to network architectures, protocols and applications.

2.1

Network Topologies and Architectures

Network topologies describe the arrangement and mapping of networked elements, such as computers, printers, wires and routers. Mappings can be physical or logical. Physical topology describes where the elements are actually located and how they are interconnected with wires. Logical topology on the other hand, referrs to the path data appears to take when traveling from one network host to another [1]. A network’s logical topology might be very different from the underlying physical topology, but it is bound by the network protocols that direct how the data moves across the network. Application graphs are a generalization of logical topologies in that they provide a picture of how data moves between hosts, but from a very high-level view.

There are several shapes used to describe network topologies including bus, tree, star, mesh and ring. In the case of a physical network, these shapes have an impact on

(19)

Figure 2.1: Four network topologies: bus, ring, star and mesh [1]

network performance, reliability and ease of management. For example, a bus network is cost-effective and easy to implement, but the architecture can only support a limited number of hosts and a bad cable will bring down the entire network. A star network allows for the isolation of the periphery nodes, but the central hub might be a single point of failure for the network. Logical topologies show the exchange of information between entities that are not physically connected by the network infrastructure. For example, IBM’s Token Ring network technology is a logical ring but is physically wired in a star topology.

In terms of software application models, two prevalent architectures are found in computer networks: the client-server model and peer-to-peer (P2P) architectures. In the client-server model, a client machine is responsible for initiating a request to some application running on another computer. The server waits for an incoming request from a client and then sends a response. Client-server architecture allows for computing responsibilities to be divided up among servers in the network, where one computer might act as a web server, another as an email server and so on. While the data sent between the client and server might go through several network devices, the logical data flow is a single link between the two nodes. A star network could then be induced by several clients connecting to a common server (see Figure 2.1). In a P2P

(20)

10 network, nodes both initiate and respond to requests from other computers on the

network known aspeers. Consequently, the logical topology of such interactions could

form a mesh network. This study examines the characteristics of logical topologies extended to the application layer, modeled as application graphs.

2.2

Computer Network Reference Models

Application graphs are created using information from several layers of the network communication process. Data goes through a series of transformations before being sent to its destination, including breaking the data into manageable fragment sizes, adding quality of service information, specifying how the data should be transmitted and converting it into the electrical pulses that traverse the wire. Three layers in particular are of interest: the network, transport and application layers, described in Sections 2.3–2.5.

There are two fundamental models referenced when describing network layers: the

OSI model and the TCP/IP model. The protocols (rules that govern the syntax and

meaning of data sent between entities) associated with the OSI model are rarely used, but the features described at each layer are still important. In contrast, the TCP/IP model is not as rigidly defined as the OSI model, but the protocols associated with it are widely used [2]. This section provides an overview of these models, depicted in Figure 2.2.

2.2.1

The OSI Model

The Open Systems Interconnection Basic Reference Model (OSI Model) was designed to promote international standardization of the protocols used in communication networks. There are seven layers in this model: the physical layer, data link layer, network layer, transport layer, session layer, presentation layer and application layer

(21)

[19]. The physical layer deals with representing and transmitting raw bits over a communication channel. Well known examples include Ethernet over twisted pair (10BASE-T, 100BASE-TX) and 802.11/a/b/g wireless standards. The task of the data link layer is to correct transmission errors from the physical layer and provide the means to enable point-to-point communication between hosts within a local area network. This layer arranges data into frames and also provides medium access control to share communication channels between multiple users.

The network layer determines how packets are routed from the source to the desti-nation, allows the interconnection of heterogeneous networks and provides congestion control. The next layer in the model, the transport layer, provides logical commu-nication between processes on the hosts and is the first true end-to-end layer in the model. The session and presentation layers are not generally used; their intent is to provide session management between hosts, synchronization, interruption recovery and “on the wire” management of abstract data structures. The final layer in the OSI model is the application layer. This is the layer at which a user directly interacts with the program (a web browser, for example) that sends network data.

(22)

12

2.2.2

The TCP/IP Model

First proposed in 1974, the TCP/IP model [20] presents a slightly different view of network communications with four layers that are not as strictly defined as those in the OSI model. Whereas the OSI model was developed before the associated protocols, the TCP/IP model was developed based on protocols that already existed, taking its name from its two key protocols. The host-to-network layer is somewhat ill-defined and does not specify the protocols necessary for a host to send packets to the internet layer. It combines elements of the OSI model’s physical and data link layers. The internet layer is analogous to layer 3 of the OSI model. Familiar protocols like IP (Internet Protocol) and ICMP (Internet Control Message Protocol) are a part of this layer.

The third layer of the TCP/IP model is the transport layer, which maps directly to the transport layer of the OSI model. It allows for end-to-end communication of hosts on a network, using the TCP (Transmission Control) and UDP (User Datagram) protocols. A need for the session and presentation layers was not perceived, so the TCP/IP reference model does not contain them explicitly. The fourth layer, the application layer, will contain them if necessary. This layer contains all of the high level protocols such as HTTP, SMTP and DNS.

Although there are certainly similarities between several layers of the two reference models, this paper will use OSI model terminology. This allows for a finer distinction between network services offered at each layer to be made. The important lower-level protocols for application graphs, however, are those that were originally associated with the TCP/IP model, namely TCP and UDP.

(23)

2.3

Layer 3: The Network Layer

The network layer is concerned primarily with delivering packets from one host to another through a series of routers. It attempts to maintain some quality of service for variables such as delay, transit time and jitter while forwarding packets along until the destination is reached.

Figure 2.3: An IP datagram header [2]

Figure 2.3 shows all of the fields contained in the header of an IP data packet. For modeling network communications, however, only two fields are of interest: the source address and the destination address. Each IP address identifies a unique node in an application graph. The protocol field tells the network layer which transport process to give the data to. Two common options are TCP and UDP, described next.

2.4

Layer 4: The Transport Layer

The transport layer is responsible for getting data to and from applications running on the host machine, providing logical end-to-end communication between the appli-cations. There are two types of service available to the upper layers, connectionless or connection-oriented. The simpler of the two is connectionless, implemented by UDP. The delivery and ordering of UDP packets is unreliable, but there is less connection overhead associated with the transfer. Connection-oriented service, provided by TCP, establishes several properties of the transmission ahead of time, such as data window

(24)

14 sizes and congestion control mechanisms. TCP packets are given sequence numbers that are kept in order. Although IP networks are still only “best-effort” as no re-sources are reserved ahead of time, TCP provides reliable communication between hosts.

(a) UDP header

(b) TCP header

Figure 2.4: UDP and TCP datagram headers [2]

TCP and UDP headers (Figure 2.4) contain fields for the source and destination port numbers. Port numbers serve as numerical identifiers for processes. They are 16 bits in length, resulting in 216_{possible ports, numbered 0 through 65535. The Internet} Assigned Numbers Authority (IANA) is responsible for maintaining assignments of port numbers for specific uses [8].

2.5

Layer 7: The Application Layer

The primary objective of this thesis is to identify application usage via communication patterns at the application layer. Although not 100% accurate, port numbers are used

(25)

as the application labeling scheme for training the application classifier, described in Chapter 5. Some applications communicate on certain port numbers with a high degree of reliability. For example, when a user opens a web browser and requests a web page, a connection is established from the user’s computer from a randomly assigned upper port number to port 80 of the web server hosting the page. In this case, Hypertext Transfer Protocol (HTTP) is the layer 7 application protocol used, with the web server listening for connections on port 80, the IANA official port for the HTTP protocol. This process is depicted in Figure 2.5.

Source Destination

192.168.1.100:29985 → 208.122.19.56:80 User requests a web document

208.122.19.56:80 → 192.168.1.100:29985 Server responds to request

Figure 2.5: Example communication between a client and a web server

There is no shortage of application layer protocols. Common examples include SMTP or POP3 for email services, DNS for domain name resolution, peer-to-peer protocols like BitTorrent and many, others. This study focuses on seven applications that reflect a variety of application types and also have official port assignments from the IANA. Protocol selection is detailed in Chapter 4, while the steps taken to create application graphs based on the layer 3, 4 and 7 information are detailed in Section 5.3.

(26)

Chapter 3:

Graph Analysis

Graphs are a well-studied concept in mathematics, dating back to Leonhard

Eu-ler’s 1736 analysis of the Seven Bridges of K¨onigsberg which laid many of the

foun-dations of graph theory [21]. Simply put, graphs are a collection of objects with connections between them. These abstract structures model problems in a variety of areas, including logistics, communication systems, biological and chemical com-pounds and social-group structures [22]. The first part of this chapter reviews the basic concepts and terminology required by the study of application graphs and then introduces several “traditional” measures used to describe graphs. In the latter half of this chapter, network motifs are defined in terms of their graph characteristics and are related to application graphs.

3.1

Graph Terminology and Basic Properties

Unfortunately, some of the mathematical notation used in graph theory tends to differ from text to text. Many of the basic properties and definitions are standard, but for those that are not, this thesis borrows notation primarily from two sources: Chartrand and Zhang [23], and Busacker and Saaty [22]. Abbreviations and function-like syntax replace many Greek letters in this style of notation to avoid confusion. For example, x(G) indicates that xis a property of the entire graph, whereas y(v) indicates y is a property local to a particular vertex.

Vertices (or nodes, as they are often called in computer science) are the

funda-mental units in a graph. They can represent any object, such as a person, process,

city, or a computer. Vertices are linked together by edges, which show a relationship

(27)

between the vertices they connect. Some examples include roads connecting cities, social interactions between people, or physical links between computers in a network.

Agraphis a collection of vertices and edges taken together. Formally, a graphG

con-sists of a finite, non-empty set of vertices V, connected by a set of edges E, written

as G= (V, E). This definition implies that a graph must have at least one vertex in

it, but it does not necessarily have to contain any edges.

Figure 3.1: A graph with five nodes and five edges

The set of vertices V is written V ={v0, v1, . . . , vk}. The cardinality of this set,

|V |, is the order, or number of nodes in the graph. A graph’s edge set is defined as

E ⊆ {{u, v} |u, v ∈ V}. For brevity, an edge can be written eij to mean an edge

linking nodei to node j. |E| is the number of edges in the graph, known as its size.

Thedegreeof a node, deg(v), is the number of nodes thatv is adjacent to in the graph

(those that can be reached by traversing one edge). This set of nodes is known as N(v), the neighborhoodof v. In Figure 3.1, nodes 2 and 3 are adjacent to node 1, and N(1) = {2,3}.

3.2

Types of Graphs

Modeling complex systems often requires more detail than just nodes and vertices as described in the previous section. One possible approach is to orient the graph to show asymmetric relationships between objects. In an undirected graph, the edges are pairs of unordered vertices, that is, e_ij = e_ji. The edges in a directed graph,

however, are ordered pairs, and eij 6= eji. The degree measure can be extended to

(28)

18

of vertices of G from which v is adjacent, and the number of vertices in G to which

v is adjacent, respectively. The associated undirected graph of a directed graph is

obtained by disregarding the ordering of the end points of each edge.

The assembly line process for building an automobile can be modeled as a directed graph, where each stage of the process is represented by a node in the graph. The directionality of the edges indicate that each step follows in a specified order and that

the process cannot happen in the reverse order. Edges of a graph can be weighted,

usually with an integer or real number, to imply a “cost” associated with traversing an edge, or to further describe how the edge is used within the overall system. In the auto assembly line graph, an edge weight could represent the amount of time a particular step in the process takes.

3.3

Traditional Graph Measures

Several graph measures exist to describe the structure of a network, such as how connected a vertex is, its distance from other vertices, and how it is positioned in the graph. These measures have been used to characterize many different types of networks and describe their growth patterns [24]. The following sections define the measures selected for this study and provide examples of several of the concepts.

3.3.1

Distances and Path Lengths

The distance between two nodesuandv, written d(u, v), is the length of the shortest path between them. In an unweighted graph, this is equal to the number of edges in

the path. In a weighted graph, the length of pathP is P

w(e) for e∈P. Dijkstra’s algorithm [25] is one common method for determining this path through a network.

For a vertex v in a connected graph, the eccentricity of v, e(v), is the distance

(29)

min{e(v)| ∀ v ∈ V} and the diameter diam(G) = max{e(v)| ∀ v ∈ V}. A vertex is said to becentral if e(v) = rad(G) and peripheryif e(v) = diam(G).

In Figure 3.1 (reproduced above for convenience), e(1) = 3 because node 5 is the node farthest away from node 1 in the graph and requires traversing three edges to

reach it. The radius of the graph rad(G) = 2 because e(1) = e(2) = e(5) = 3, but

e(3) = e(4) = 2. Also, diam(G) = 3, the maximal eccentricity value of all nodes in

the graph. According to the definitions above, nodes 3 and 4 are central, while nodes 1, 2 and 5 are said to be periphery nodes.

3.3.2

Centrality Measures

It is helpful to describe the centrality measures of a graph in terms of social networks in order to make an analogy: the centrality measures of a vertex indicate how important, prominent, or powerful the vertex is in a graph. The following is a brief examination of four common centrality measures proposed by Freeman and Bonacich [26, 27]. The most basic of these is degree centrality, or CD(v), defined as _|deg(_V_|−v)₁. This equation

can be modified for directed networks to produce CD in and CD out. In terms of social

network analysis, indegree is interpreted as a a measure of popularity, while outdegree is interpreted as gregariousness. In a dense adjacency matrix representation of a

graph, the time required to calculate the degree centrality for all nodes is O(V2_),

(30)

20

Betweenness centralityis the fraction of shortest paths between all pairs of vertices

that pass through a particular vertex v. This measure is given by the equation:

CB(v) = X s6=v6=t∈V s6=t δst(v) δst (3.1)

where δ_st is the number of shortest paths from s to t, and δ_st(v) is the number of

shortest paths from s to t that pass through v. A vertex with a higher betweenness

centrality occurs on more shortest paths than those that do not. This measure can indicate how “powerful” a vertex is, because it influences the spread of information

through a network. O(V3_{) calculations are required to determine betweenness and}

closeness (described next) using the Floyd-Warshall algorithm to find all shortest paths.

Closeness centralityis defined as the average shortest path length between a vertex

vand all other vertices reachable from it. In network theory it is regarded as a measure of how long it will take information to spread from one vertex to the other reachable vertices in the graph. Closeness centrality is given by:

CC(v) =

X

t∈V

d(v, t)

n−1 (3.2)

where n ≥ 2 is the number of vertices reachable from v. Those vertices in G that

have shorter paths to other vertices will have a higher closeness centrality.

The eigenvector centrality is a more sophisticated version of the degree count of

a vertex, acknowledging that not all connections within a network are equal. The

eigenvector centrality score of a vertex i is proportional to the average degree of

i’s neighbors. In social networks, this reflects the idea that people connected to

(31)

less influential people [28]. If the graph is represented as an adjacency matrixAwhere Aij = 1 if nodeiis connected to node j, andAij = 0 otherwise, eigenvector centrality

can be written: xi = 1 λ |V| X j=1 Aijxj, (3.3)

where λ is a constant, and xi is the degree count of vertex i, and xj is the degree

count of vertex j. Defining the vector of centralities x = (x1, x2, . . .), the previous equation can be rewritten as

λx = A·x (3.4)

To force the centralities to be non-negative, it can be shown thatλmust be the largest

eigenvalue of A, and xthe corresponding eigenvector [28].

3.3.3

Clustering Coefficient

Theclustering coefficientmeasure begins to extract a little bit more information about

the shape of structures within the graph, whereas many of the previous measures rely on information about paths and path lengths between nodes. Instead, clustering

coefficient of v measures the percentage of edges that exist among neighborhood of

v, divided by the number of edges that could possibly exist among them. For an

undirected graph, the clustering coefficient is defined by the following equation:

C(v) = 2|{ejk}|

deg(v)(deg(v)−1) : vj, vk∈N(v), ejk ∈E (3.5)

Another way to view clustering is the ratio of triangles (three nodes connected by three edges) to the number of triples (three nodes and two edges, both incident to

(32)

22 that if v1 connects tov2 and v2 connects tov3, then there is a greater chance that v1 and v3 will be connected as well [29, 28].

3.3.4

Application of Traditional Graph Measures in

Com-puter Networks

Past studies have looked at graph characteristics for the purpose of anomaly detec-tion and traffic classificadetec-tion. Staniford et al.’s GrIDS system [30] generates graphs describing communications between IP addresses and can generate alerts based on a set of rules, such as a vertex degree count crossing some threshold value. The BLINC traffic profiling system developed by Karagiannis et al. examines the interactions be-tween hosts to identify an application, and utilizes measures including degree counts and neighborhood information [31]. This thesis is similar to the BLINC study in that they both evaluate interactions among hosts at the functional and social lev-els in order to identify applications. The BLINC study, however, exploits additional information such as the transport protocol and average packet size and attempts to match network behavior to a library of empirically derived “graphlets”. In contrast, this study examines a wider variety of graph measures, and also proposes the unique approach of searching application graphs for motifs.

3.4

Network Motifs

A network motif is a pattern of interconnections that occurs in a graph significantly

more often than it does in randomized networks. Studies performed by Milo et al.

find motifs in several types of complex networks, and that a small number of network motifs occur repeatedly across network types. They describe motifs as fundamental building blocks of networks, capable of defining universal classes of networks [3, 16]. Research suggests that some motifs can be associated with a particular function,

(33)

discussed in Section 3.4.2. The work performed in this thesis extends this idea to application graphs to determine if particular motifs indicate what application protocol a host is using.

3.4.1

Definition of a Motif

In mathematical terms, a graph G′ = (V′_{, E}′_{) is the subgraph of} _G _if _V′ _⊆ _V _and

E′ ⊆ E. A motif then, is any of such subgraphs that occur significantly more than in random networks. The level of significance required depends on the problem, but

as an example Milo et al. consider those patterns with a p-value of 0.01, meaning

that there is only a 1% chance of seeing a particular pattern as many or more times in random networks than is observed in the original network [3]. Motif detection is depicted in Figure 3.2.

Figure 3.2: Schematic view of motif detection [3]

Generally speaking, motifs of order 3 or larger are considered when performing motif searches. However, searching for large motifs can be prohibitively expensive because of the computational complexity involved. Several algorithms [32, 33] have been developed to increase the efficiency of these searches and allow for the analysis of large networks containing thousands of edges and nodes. Figure 3.3 shows the

(34)

24 thirteen possible directed edge combinations for motifs of order 3. In application graphs, the edge directionality indicates the flow of data between two hosts, such as a request from a client to a server, or the response from the server back to the client. Additional motif characteristics are described in Chapter 5.

Figure 3.3: All 13 configurations of order 3 connected subgraphs [3]

3.4.2

Function of Motifs

Several studies suggest that motifs can be linked to specific functions within a network.

Milo et al. analyze the motifs found in the direct transcriptional interactions in

Escherichia coliand find three highly significant motifs [16]. Their study states that

the appearance of network motifs at high frequencies suggests that they may have some specific functions in the information processing performed by the network.

A different study analyzes the feed-forward loop, or FFL (Figure 3.4). In a FFL,

X regulates transcription factor Y, and both jointly regulate gene Z. Mangan et

al. show that it acts as a sign-sensitive delay element, in that it responds rapidly to step-like stimuli in one direction (ON to OFF) and at a delay to steps in the opposite direction (OFF to ON). They argue that this type of control mechanism can filter out fluctuations in input stimuli [34].

X → Y

ց _

Z

(35)

3.5

Analysis of Application Graphs

The application graphs studied in this work are hybrid networks, reflecting a mix of social interactions and computer network architectures. Although there are no genes present that require precise regulation like in the biological networks discussed previ-ously, network functions are carried out in a controlled environment that must follow a set of established protocols. For example, if a user wishes to talk to another user on a network via the AIM instant messaging service, each user must first authenticate and establish a connection to a central server; the computers do not simply send text back and forth between the two. Protocol behaviors are described in Chapter 4.

In terms of graph properties, application graphs are modeled with unweighted, directed edges and do not contain any self-loops. If a computer connects to a service running locally, the connection goes over the loopback interface, and is not visible on the network traces examined. The edge direction is set to match the observed traffic flow, which may be either unidirectional or bidirectional. If two computers communicate at any time during a period of monitoring, an edge is drawn between them. Edge weights are not used in this study, but may be considered in the future to provide further detail when determining the application type.

The traditional graph measures defined previously are appropriate for the study of application graphs because of the social aspect of the communications. Application graphs are formed through specific user actions, such as surfing the web, checking email, and sharing music. It is also for this reason that the study of motifs within application graphs is interesting. In systems biology, processes such as gene transcrip-tion and regulatranscrip-tion are not voluntary tasks; cell survival depends on them. Chapter 5 details the methodology employed to describe application graphs based on their traditional and motif-based characteristics.

(36)

Chapter 4:

Data Selection and Considerations

As is the case in any type of research, proper data selection is imperative for producing accurate results and analysis. This chapter examines several of the issues involved with the collection and sampling of computer network data in an effort to build a baseline measure for “normal” network behavior, and concludes with an overview of the seven application protocols selected for this study.

4.1

Network Trace Files

The pcap library provides the packet-capture and filtering engines of several popular network analysis and monitoring tools [35]. Some examples include tcpdump, nmap, Wireshark and the Snort IDS. Tcpdump in particular is a valuable tool for capturing packets as they come across a network interface card, a process known as “sniffing”, and logging them in a raw format which can then be analyzed by other tools as shown in Figure 4.1. Although tcpdump is able to capture all of the data associated with each network packet such as packet length, flags and checksum values, only a few of the fields specified by the IP, TCP, and UDP RFC documents [36, 37, 38] are needed to model application graphs: source IP, destination IP, source port and destination port. These four pieces of information are enough to uniquely identify a process running over a computer network between two host. The creation of application graphs is discussed in Chapter 3 and the implementation detailed in Chapter 5.

4.2

Challenges Associated with Network Data Collection

Pang et al. identify three key goals of sharing network data with other researchers:

verification of previous research, direct comparison of competing ideas on the same 26

(37)

Figure 4.1: Tcpdump output containing timestamp, protocol, source IP, source port, destination IP, destination port, packet length and packet flags

data, and a broader view than a single investigator can obtain on their own [39]. Unfortunately there are several concerns that must be addressed such as the amount of data collected, the accuracy of the data and protection of users’ privacy. This section outlines a few of these issues.

4.2.1

Data Capture

Increased utilization and line speeds of today’s high speed, high capacity networks present challenges for collecting network data in terms of data rate, storage and pro-cessing [40]. A packet sniffer can easily log hundreds of gigabytes of data in a single day, even on a moderately sized network. A study of traffic collected at Dartmouth College shows a significant increase in peer-to-peer, streaming multimedia and VoIP traffic, whereas initial network usage was dominated by web traffic [41]. Both static and streaming multimedia applications require significantly more bandwidth than simple web documents or other non-interactive file types. Research characterizing

YouTubeTM _{traffic found that 90% of videos requested by University of Calgary}

cam-pus network users were larger than 21.9 MB [42], orders of magnitude larger than the file sizes of other content types.

In addition to requiring a great deal of storage space, high speed packet capturing also requires fast memory access and high disk speed so that packets can be written to the disk before the capture buffer is full and loses packets. Although undesirable, this

(38)

28 behavior does not affect the study of application graphs proposed by this study, which

uses individual packets to establish a communication link instead of aggregatedflows

(all packets associated with a particular origin and destination pair). Two nodes in an application graph will be connected if any packets are sent between them, regardless of which part of the flow they come from, beginning, middle, or end. Therefore, partial flows are considered in these graphs.

Another advantage of using individual packets is that TCP and UDP sessions don’t need to be defined. TCP connections are established by a three-way handshake between the client and server, and are terminated by a FIN and FIN-ACK sequence. The formal establishment or tear-down of a TCP session might not be correctly logged for several reasons: the sniffer could be turned on or off in the middle the session, parts of the handshake could be dropped by the sniffer, or either the client or server could disconnect without following the closing protocol. UDP doesn’t establish formal sessions like TCP does, so UDP flows are sometimes segregated by establishing a timeout value for which the flow is terminated if there is no activity. The edges in an application graph are binary in nature and only indicate whether or not host A communicated with host B.

4.2.2

Privacy and Sanitization of Data

Monitoring network traffic may raise serious privacy concerns, as data sent incleartext (i.e. not encrypted) is easily read by sniffing. Data such as usernames and passwords sent to web sites via the HTTP protocol instead of the encrypted HTTPS protocol can be effortlessly obtained by an attacker on the network. Even if sensitive information is not being sent, an attacker can log all text and images downloaded by a user as he or she surfs the web, and reassemble the browser sessions later.

(39)

the resulting log files to ensure the privacy of users, but they must also disguise the IP addresses of machines on the network so that an attacker does not have a map of the network with which to launch an attack. Several methods and tools have been developed to accomplish these tasks, such as [43, 39, 44, 45].

Often times a network administrator or developer does not need to log the packet payloads to perform tasks such as verifying routes or debugging programs that utilize sockets. If this is the case, only the packet headers are logged and the rest of the packet is discarded. Only storing packet headers also helps alleviate the issue of storage space discussed in the previous section. This shortcut cannot be used in the case of signature-based intrusion detection systems which rely on scanning the payload of a packet for known signatures that indicate an attack. The methods proposed in this thesis do not consider packet payloads, but only the information readily available in the packet header.

4.2.3

Network and Data View

Ideally, a “God’s eye view” of a computer network would reveal all communication links within the network as well as connections from within the network to other networks outside of it. Unfortunately, many sniffers are placed at gateway nodes at the edge of a network and only capture traffic leaving from and coming to the network. As a result, traffic originating from within a network and destined for internal servers (web and application servers, email servers, etc.) is not logged because it never reaches the gateway. Some data collection projects such as [46] attempt to address the lack of internal enterprise network traffic that is available for research.

One drawback of the research method proposed in this paper is that it currently assumes network activity for a particular application is limited to a single port. This is not true for out-of-band protocols such as FTP that send authentication and control

(40)

30 messages on over one port but use another for data transfer. Even if provided with a complete view of the network, the data is segmented into individual port numbers for analysis. Therefore, network communications over multiple port numbers will not be visible. If a client connects to a web server on port 80, and that web server requests

data from a MySQLTM_{server (default port 3306) or an IBM WebSphere}®Application

server (default port 2809), only one part of the process is visible at a time, either client to web server, web server to database, or web server to application server. Seeing all components of a particular process would reveal interesting structural motifs, but the motif and node properties examined in isolation still hint at the function of the nodes. Possible techniques for aggregating data for different views are discussed in Chapter 7.

4.3

Data Sources

The data sets used in this study come from three different sources in an attempt to show measureable differences in protocols and behavior, even across networks with different underlying architectures and usage patterns. One data set often used in in-trusion detection research is the 1998 & 1999 DARPA Inin-trusion Detection Evaluation Data Set [47]. The primary reason this data was not selected, however, is because of

its age; as Henderson et al. point out, the type of traffic seen in computer networks

has changed [41]. This is not to imply that the approach described in this thesis would not work with older data, but that newer network traces containing a wider variety of application use might prove more interesting to examine. Additionally, traffic for the DARPA initiative is synthetic, whereas the data sets described in this section contain real network data that reflects current trends in in network and protocol use. Table 4.1 provides overview statistics for the traces examined.

(41)

4.3.1

Dartmouth College Wireless Traces

The CRAWDAD project at Dartmouth College provides an archive of wireless network data from several contributors around the globe. Included in the archive is 163 GB of packet headers captured from eighteen buildings on the campus during the Fall 2003 semester [48]. Data collected is representative of traffic in residential buildings, academic buildings, as well as the library. It has been sanitized in such a way that the IP addresses are consistent across traces, allowing for a more complete picture of network use. The campus wireless network contains several thousand users and over 450 wireless access points.

4.3.2

LBNL/ICSI Enterprise Tracing Program

The ICSI Enterprise Tracing Program hopes to provide a view into the internal traffic for an entire enterprise site [46]. These traces, taken from the Lawrence Berkeley National Laboratory (LBNL) in 2004 and 2005 span more than 100 hours of activity and include traffic from several thousand internal hosts. The data is sanitized in accordance with the methodologies described in [39]. Like the Dartmouth wireless traces, only packet headers were captured and the payload discarded.

Dartmouth LBNL OSDI

Capture length (seconds) 21818.575 600.079 193.348

Number of packets 2023527 2261261 324116

Avg. packets/sec 92.743 3768.274 1676.335

Number of bytes 1092602793 778659304 94814149

Avg. bytes/sec 50076.726 1297595.353 490380.757

Table 4.1: Summary statistics of three trace files examined

4.3.3

OSDI Conference Network Traces

The last source of data used for analysis in this paper also comes from the CRAW-DAD archive, and includes traces from ten sniffers at the 2006 Operating Systems

(42)

32 Design and Implementation (OSDI) Conference [49]. Researchers collected this data to enable the analysis of the behavior of a heavily used wireless LAN. The data was initially sanitized on-the-fly and then reprocessed off-line to further obfuscate the MAC addresses as necessary. Although this data set does not have the “enterprise” characteristics of the previous two, its inclusion helps to determine the generalizabil-ity of the methods proposed in this work to different networks and network points of view.

4.4

Protocol Selection

Several criteria were used to select the protocols examined in this paper including availability, popularity and diversity. First and foremost there must be enough data samples of a particular protocol in the trace files to be able to perform the graph characteristic and motif analysis. To achieve this goal, more well-known and widely used protocols were chosen. Also, protocols that have different architectures (client-server vs. peer-to-peer for example) were selected in order to highlight the differences in node characteristics. Because packet payloads are not inspected, applications that operate on official IANA port numbers and are in-bound protocols are used so that reasonable assumptions can be made about the data, and that the port numbers

accurately reflect the protocol being used. As a reminder, the port number is not

used to classify applications, but is only used to provide class labels.

AOL Instant Messenger (AIM)

AOL’s instant messaging client has been a popular application for users around the world for over a decade. AIM uses a proprietary protocol called OSCAR to commu-nicate with other clients [50]. Multicast routing architectures exist and are used by some chat programs such as IRC, but all AIM connections go through a centralized

(43)

server. Users authenticate to the AIM login server on port 5190. Once the user’s session has been established, all chat communications also go through central AIM servers on port 5190. The exception to this is when a user attempts to establish a direct connection to another user (such as when sending pictures or other files), in which case the communication goes directly to the other user and bypasses the cen-tral AIM servers. Therefore AIM is primarily a client-server application, with some peer-to-peer capabilities as well. This study restricts itself to communications on port 5190, so any direct file transfers are ignored.

HyperText Transfer Protocol (HTTP)

The HTTP protocol is used to retrieve hyper-linked text documents from the world wide web [51]. A client initiates an HTTP request by connecting to a web server, typically on port 80. The web server then responds with a status line, as well as another message including the contents requested, such as an HTML file or an im-age. HTTP is a stateless protocol, which means no information is retained between requests. This protocol falls directly into a client-sever architecture model.

Domain Name System (DNS)

DNS is a hierarchical naming system that maps meaningful domain names to numer-ical IP addresses [52]. If a DNS server does not know the correct mapping for a given domain, it can instruct the DNS resolver on the client side of where to query next to attempt to resolve the address. DNS primarily communicates via UDP on port 53, and also follows a client-server architecture. Its hierarchical nature however makes it an interesting selection for analysis.

(44)

34

Kazaa

Kazaa is a peer-to-peer file sharing application built on the FastTrack protocol that

operates on port 1214. This protocol employs the use of supernodes for scalability

purposes. A supernode is any node on the network that also acts as a proxy and relayer for the network, and handles data flow and connections for other users. A peer-to-peer network should be more highly connected than a client-server model since all nodes in the network act as both clients and servers for each other.

Microsoft Active Directory (MSDS)

Microsoft Active Directory is a client-server protocol that provides a way to manage objects and relationships across a network. Objects can be resources such as printers, services such as email, or users (accounts and groups). It provides several services such as DNS-based naming, authentication methods and LDAP-like directory services. Active Directory Domain Services (MSDS) is the central location for configuration information, authentication requests and information about network objects [53]. It operates on port 445. Windows shares and Active Directory are commonly used in Windows-based networks, and its inclusion for analysis provides an example of platform-dependant network traffic.

NetBIOS Name Service

Netbios (Network Basic Input/Output System) is used to allow applications on sep-arate computers to communicate over a local area network. It provides three main services: (i) name service for name registration resolution, (ii) session service for connection-oriented communication, and (iii) datagram distribution service for con-nectionless communication. The name service communicates over port 137 with either the TCP or UDP protocol. A computer, which has a unique host name, might have

(45)

multiple NetBIOS names. The inclusion of NetBIOS for analysis is interesting be-cause it often receives port scans and is frequently the target of malicious attacks. The architecture of Netbios communications is a bit unique in that it does not fall cleanly into a client-sever model, nor does it fit the P2P model. It will occasionally use broadcast messages, and Netbios hosts can also be configured as peers.

Secure Shell (SSH)

Secure shell is a protocol that allows encrypted data to be sent between two com-puters on a network. It is often used for remote administration of other comcom-puters, creating secure tunnels for web browsing and securely copying files. SSH is primarily used on UNIX/Linux environments and runs on port 22. SSH utilizes a client-server architecture.

(46)

Chapter 5:

Experimental Methodology

The analysis of application graphs involves several stages and requires the use of many different software tools. The major tasks include: parsing and storing network data, creation of graphs and vertex profiles, node property analysis, motif searching, and creating a classifier to predict application labels. Optimization of the classification process via feature weighting is also considered. This chapter describes the process as well as the tools used, which are open-source and freely available. For the reader’s convenience, a summary diagram is given in figure 5.1.

Parse and store data Construct application graphs Traditional profile creation and analysis Motif−based profile creation and analysis Nearest neighbor classification Nearest neighbor classification Evolutionary attribute weighting Evolutionary attribute weighting Wireshark Afterglow Python NetworkX Python NetworkX Python FANMOD RapidMiner RapidMiner

Process

Tools

Figure 5.1: Overview of the proposed methodology and tools used

5.1

Hardware and Linux System

All tests were run on a multi-core system running the Linux kernel version 2.6.22. The system contains four dual-core AMD 64-bit processors running at 1.8 GHz each. It uses a shared-memory architecture with 8 GB of memory. Although most of the tools are not written to take advantage of multiple cores, the hardware architecture allows for analysis of multiple network traces to happen simultaneously.

(47)

5.2

Packet Capture and Storage

The network traces are in the pcap format as described in Chapter 4. Modified parsers based on those distributed as part of The Afterglow Project [54] were used to parse pcap files. Additionally, Wireshark [10] was used to extract basic information from the network trace files, including the source IP, destination IP, source port, destination port, timestamp, protocol and packet length.

t s h a r k −t e −r i n p u t . pcap tcp or udp | python t s h a r k 2 m y s q l . py t

Figure 5.2: Storing packets from a pcap file into a MySQL database

Once the packets have been parsed, they are stored into a MySQLTM _database

for later retrieval. This is done to facilitate later steps in the process so that packets can be selected based on their source or destination port numbers, protocol type, or other attribute. Figure 5.2 illustrates the process of parsing and storing information

from input.pcap into a MySQL database table t. Each network trace file is stored in

a unique table within the database.

5.3

Creation of Application Graphs

The next step in the process is to model the application graphs and analyze the traditional measures as described in the first half of Chapter 3. NetworkX is a package for the creation, manipulation and study of complex networks, written in the Python programming language [55]. Graphs are created by querying a MySQL database table for all entries for which either the source or destination port number matches the port number of one of the seven application protocols. Although port numbers do not always accurately reflect the application bound to them, they are generally a strong indicator, especially for the well known port numbers 0-1023 (e.g., HTTP servers typically listen on port 80 for connections). For the purposes of this work, the

(48)

References

Download now ( PDF - 99 Page - 1.13 MB )

Outline

Challenges Associated with Network Data Collection