Flow Classification Using Clustering And Associative Rule Mining

(1)

ABSTRACT

CHAUDHARY, UMANG KAMALAKAR. Flow Classification Using Clustering And Associative Rule Mining. (Under the direction of Dr. Michael Devetsikiotis.)

Traffic classification has become a crucial domain of research due to rise in applications that

are either encrypted or tend to change port consecutively. The challenge of flow classification is to determine the application without any information of the payload. In this paper, our goal

is to achieve a robust and reliable flow classification using data mining techniques. We propose

a classification model which not only classifies flow traffic, but also performs behavior pattern profiling. The classification is implemented using clustering algorithms, and association rules

are derived by using Apriori algorithms. We are able to find association between flow

parame-ters for various applications, therefore making the algorithm independent of the characterized applications. The rule mining helps us to depict various behavior patterns for an application,

(2)

Flow Classification Using Clustering And Associative Rule Mining

by

Umang Kamalakar Chaudhary

A thesis submitted to the Graduate Faculty of North Carolina State University

in partial fulfillment of the requirements for the Degree of

Master of Science

Computer Engineering

Raleigh, North Carolina

2010

APPROVED BY:

Dr. Yannis Viniotis Dr. Do Young Eun

(3)

DEDICATION

(4)

BIOGRAPHY

Umang Chaudhary received his Bachelor of Engineering degree from University of Mumbai,

India. He is currently pursing Master of Sciences in Computer Engineering in Network Traffic Classification in North Carolina State University in Raleigh. He worked as research associate

for Network Performance Research Group for accurate classification of NetFlow traffic using

(5)

ACKNOWLEDGEMENTS

I would like to thank my parents, Kamalakar and Lata, for their support and commitment in

helping me succeed. Also, to my sister, Bhavna, who constantly reminds me that life teaches you more than school work.

I am grateful for the advice and guidance of my advisor, Dr. Michael Devetsikiotis, and for

that of my committee members, Dr. Yannis Viniotis and Dr. Do Young Eun. Also, Ioannis Papapanagiotou, who consistently supported me and shared the vision of this research work.

A special thanks to all of the members of the ITng Services at Centennial Campus who gave

(6)

TABLE OF CONTENTS

List of Tables . . . vii

List of Figures . . . .viii

Chapter 1 Introduction . . . 1

1.1 Motive and Goals . . . 1

1.2 Contribution . . . 2

1.3 Thesis Organisation . . . 3

Chapter 2 Background and Related Work . . . 4

2.1 Packet Classification and NetFlow Classification . . . 4

2.2 Related Work . . . 7

2.2.1 Port-based Classification . . . 7

2.2.2 Payload-based Classification . . . 8

2.2.3 Transport-layer based classification . . . 9

2.2.4 Machine Learning . . . 10

2.2.5 Graph . . . 12

2.2.6 Statistical Classification . . . 13

Chapter 3 Classification Model . . . 14

3.1 Classification Model . . . 14

3.1.1 Processing and Flow Data . . . 14

3.1.2 Clustering . . . 16

3.1.3 Transductive Classification . . . 19

3.1.4 Association-Rule Classification . . . 19

Chapter 4 Analysis and Results . . . 22

4.1 Traces . . . 22

4.2 Classification Metrics . . . 22

4.3 Analysis and Results . . . 24

4.3.1 K-Mean Clustering . . . 24

4.3.2 Model-based Clustering . . . 27

4.3.3 Rule based Classification . . . 30

4.4 Cluster Analysis . . . 33

4.5 Behavior Pattern for Application . . . 34

Chapter 5 Conclusion . . . 36

5.1 Summary of Results . . . 36

5.2 Future Work . . . 37

(7)

Appendix . . . 40

Appendix A Application Behavior Profiling . . . 41

A.1 DNS behavior Pattern . . . 41

A.2 Mail behavior Pattern . . . 42

A.3 HTTP behavior Pattern . . . 43

A.4 IRC behavior Pattern . . . 44

(8)

LIST OF TABLES

Table 2.1 NetFlow version 5 format . . . 6

Table 2.2 Comparison between payload and NetFlow . . . 7

Table 3.1 Distribution of application class in traces . . . 15

Table 3.2 Parameterization of Covariance Matrix in Mclust Package . . . 18

Table 4.1 Evaluation Metrics for Machine Learning Classification Techniques . . . . 23

Table 4.2 Precision & Recall for TranductiveK-Mean and Model-based classification 30 Table 4.3 Accuracy for Association Rule classifier based on K-Mean and Model-based clustering for WAND and CRAWDAD Traces . . . 32

(9)

LIST OF FIGURES

Figure 2.1 Various Network Traffic Classification Methods . . . 8

Figure 2.2 Application behavior patterns based on transport layer attributes . . . 9

Figure 2.3 Application behavior patterns based on transport layer attributes . . . 10

Figure 2.4 Graph-based classification for HTTP-TG application . . . 12

Figure 3.1 Classification Model . . . 15

Figure 4.1 K-Mean Transductive Classifier Precision Graph for WAND Traces . . . . 25

Figure 4.2 K-Mean Transductive Classifier Recall Graph for WAND Traces . . . 25

Figure 4.3 K-Mean Transductive Classifier Accuracy Graph for WAND Traces . . . 26

Figure 4.4 Bayesian Information Criterion distribution for parameterization . . . 27

Figure 4.5 Model-based Transductive Classifier Precision Graph for WAND Traces . 28 Figure 4.6 Model-based Transductive Classifier Recall Graph for WAND Traces . . . 29

Figure 4.7 K-Mean and Rule-based Classification Accuracy Graph for WAND Traces 31 Figure 4.8 Model-based clustering and Rule based classification Accuracy for WAND and Crawdad traces . . . 32

(10)

Chapter 1

Introduction

New applications create a vast diversity of Internet traffic. As the Internet continues to expand, optimum provisioning and managing has become a severe challenge [21]. These applications

are finding innovative ways to tensely use network resources.

1.1 Motive and Goals

In recent times, the network operators need a methodology in which applications can be

pro-filed and classified by taking into account their recent behavior patterns. Traffic classification

and analysis will help Internet Service Providers (ISPs) to plan resources for applications and optimize their business goals [17]. Moreover, pattern recognition and behavior profiling has

become a necessity in the discovery of malicious traffic, and Denial of Service attacks [20].

A lot of research has been done in the field of network traffic classification using various techniques and models. The classification can be done using payload attributes, transport layer

information, application behavior, statistical data modeling, and machine learning techniques. Packet-based techniques uses session information, content signatures and behavior

signa-tures for accurate classification. The accuracy rate is usually very high even for peer-to-peer

applications [15]. Packet identification faces limitations due to encryption, string search opti-mization, storage and processing overhead to support massive amounts of packet analysis.

Flow based techniques use header information and connection pattern between source and

destination IPs to deduce classification models. Though due to the non-existence of the payload information, a classifier based on netflow traces have been long regarded as a challenging process.

In this work, we applied clustering and data mining techniques to overcome this issue. In the

following a brief explanation of the classification processes is given.

Port-based classificationis a conventional method to classify traffic in the network by

(11)

is assigned by Internet Assigned Numbers Authority (IANA) [9]. Due to dynamic port

allo-cation and use of unregistered ports, port based classifiallo-cation has reduced effectiveness for peer-to-peer applications.

Payload-based classification has packets processed to match particular signatures of the

application of interest. The approach needs to remain up to date with signatures of the ever evolving applications. There is need of vigorous regular expression matching and string search

optimization for identifying unique signature values of various applications [15].

Transport-layer heuristics analyze transport layer parameters to interpret flow connection patterns of various network applications. The method overcomes processing and memory

draw-back of payload-based analysis [11]. In [12], the host related characteristics are utilized for

classification on social, functional and application level.

Statistical-based classification utilize statistical properties to build empirical models based

on parameters such as byte size, flow/packet duration, inter-arrival packet timing. Distinct

traffic flow statistics properties have been observed for network applications [4].

Graph-based Classification combines network-wide behavior and flow-level characteristics

using Traffic Dispersion Graphs (TDG). This technique provides almost 90% accuracy for

peer-to-peer traffic with backbone traces achieving 95% of precision [10].

Machine Learning based Classification utilizes powerful data mining, information heuristics

techniques for describing structural patterns in data for classification purpose. The Machine Learning Model computes patterns from training datasets. These patterns form basis for

classifi-cation of the test dataset [16]. One of the main drawbacks of the above classificlassifi-cation techniques

is that they cannot profile applications with changing behaviors, as well as there is a need constant profiler updates for any future application

1.2 Contribution

In the proposed classification model, we combine traffic clustering and application behavior profiling. Such a technique increases the classification accuracy and is independent of the format

of the applications. This is achieved by creating association rules between flows that belong

to the same applications. Association rules are calculated without any human intervention therefore making the algorithm independent of the investigated application.

Due to the need for an efficient and practical approach, we developed a two step netflow

classification algorithm. The proposed process, first utilizes a clustering algorithm to cluster and classify the data, and secondly implements association rule mining technique for labeling

flow datasets. More specifically, our contribution can be summarized as follows:

(12)

• We are able to profile behavior patterns for each network applications using Apriori as-sociation rule algorithm. With rule evaluation parameters such as lift, confidence and support, we are able to model and characterize interaction of the flow attributes.

• Due to creation of the association rules, the process isindependent of the dataset. There-fore, for the unknown or future applications that algorithm instantly defines each pattern.

1.3 Thesis Organisation

In the next chapter, we describe various network traffic classification techniques. In the third

chapter, we propose our classification model using Apriori algorithm and clustering. In the

(13)

Chapter 2

Background and Related Work

This chapter provides an overview of the related work in the network traffic classification, addresses payload and flow classification, gives a brief review of the types of classification

models, algorithms and techniques.

2.1 Packet Classification and NetFlow Classification

Packet Classification

Packet contain valuable information of the serviced application in the network. Due to

spec-ification of the application protocols, there are particular signalling information or messages

necessary to initial communication between hosts. Thus, there are certain string patterns or signatures present in the packet that map back to the application.

Packet based classification searches for characteristics signatures of applications in the

packet payload. The classification model utilize stateful reconstruction of session and appli-cation data strings from each packet’s content. There is an advantage that duplicate data

can be identified but there is a need to reassemble packets in real time. In the peer-to-peer

applications, the client payload contains signaling and peer information which form basis to application signatures. Moreover, the location and length of signatures in the packets plays a

vital role for classification. This method provided low false negative value for frequently used

peer-to-peer applications [22].

Although payload based classification achieves high accuracy, the pattern matching in the

packet payload is quite CPU intensive. There are large requirements of processing power and

memory allocation for huge packet traces. As the applications evolves, the signalling behavior changes and hence signatures needs to be update with current application patterns. In addition,

there are various implementation of a particular application which do not comply with avaliable

(14)

NetFlow Classification

Flow based classification utilizes packet header information between source and destination

hosts. Cisco’s NetFlow collects valuable flow statistics and has become widely used in traffic

monitoring tools. According to NetFlow version 5,flow is defined as a unidirectional sequence of packets between a particular pair of source and destination IP addresses.

A new flow is decided when any one of the following key fields change,

• Source IP address

• Destination IP address

• Source port number

• Destination port number

• Layer 3 protocol type

• ToS byte

• Input logical interface (ifIndex)

NetFlow are advantages for customer applications due to the following reasons [3]:

• Network Monitoring The real time network monitoring capability allows analysis of traffic patterns in individual network devices such as routers and switches as well as on a network

wide basis. This assists in troubleshooting and provide rapid solutions.

• Application Monitoring and Profiling NetFlow helps view detailed time based application usage over the network. The information can be used for planning new services and

applications. Moreover, network and application resources can be allocated according to customer base.

• User Monitoring and Profiling On the user side, NetFlow provides detailed information of user utilization of network and application resources. This assists in planning user

network, allocate access, network and backbone resources as well as avoid security threats.

• Network Planning NetFlow has provisions for collecting data for a long period of time for understanding the need of growth and upgradation in the network. NetFlow minimizes

cost by maximizing performance, capacity and reliability. Valuable information is gained

(15)

Table 2.1: NetFlow version 5 format

Bytes Contents Description

0-3 srcaddr Source IP address

4-7 dstaddr Destination IP address

8-11 nexthop IP address of next hop router

12-13 input SNMP index of input interface

14-15 output SNMP index of output interface

16-19 dPkts Packets in the flow

20-23 dOctets Total number of Layer 3 bytes in the packets of the flow

24-27 First SysUptime at start of flow

28-31 Last SysUptime at the time the last packet of the flow was received 32-33 srcport TCP/UDP source port number or equivalent

34-35 dstport TCP/UDP destination port number or equivalent

36 pad1 Unused (zero) bytes

37 tcp flags Cumulative OR of TCP flags

38 prot IP protocol type (for example, TCP = 6; UDP = 17)

39 tos IP type of service (ToS)

40-41 src as Autonomous system number of the source, either origin or peer 42-43 dst as Autonomous system number of the destination, either origin or peer

44 src mask Source address prefix mask bits 45 dst mask Destination address prefix mask bits

46-47 pad2 Unused (zero) bytes

• NetFlow Data Ware Housing and Data Mining NetFlow data can be hub for retrieval and analysis in support of applications according to user behavior. In addition, the Market Researchers access gain crucial information of ”who”, ”what”, ”where” and ”how long”

information relevant to the service providers.

Table-2.1 is reproduced from [3] for illustration of NetFlow Version 5 packet format. NetFlow

flows records are forwarded to the collector in form of UDP/IP packets. It can be heavy in computation for high speed network enivornment [3]. NetFlow features provide considerable

revelance for IP traffic under test. From symmetic uncertainity of all features, source and

destination IPs, source and destination ports, and TCP flags show high relevance. The reason is that IP address consist of social behavior of various applications such as HTTP servers, Mail

servers [24]. Hence, in our analysis, we consider high relevance flow attributes - source IP,

(16)

Table 2.2: Comparison between payload and NetFlow Attributes Payload NetFlow Computation Cost High Low

Storage Overhead High Low Encryption Fails Success Privacy Issues Diffcult Easy

Packet vs NetFlow

We provide an evaluation of advantages and disadvantages of payload and NetFlow methods as follows:

Processing at network speed for payload methods is high due to system memory bus

bot-tlenecks. With increase in volume of packets, there is risk to packet loss at high utilizations of a monitored link. In NetFlow, header information of all packets communicated between two

hosts accurately depicts IP traffic characteristics.

Storage requirements increases significantly for payload methods if packet size increases. On the other hand, flow sampling could reduce storage overhead.

Encryption causes signature matching in payload methods to fail. Alternatively, header

information remains unencrypted hence flow statistics can be computed.

Privacy Issues, due to RIAA litigations, has made providers reluctant to allow payload

analysis. Flow methodology provides an better solution to many privacy and legal issues rised

due to deep packet inspection.

2.2 Related Work

A number of techniques are used for accurate network application classification using port infor-mation, transport layer attributes, machine learning alogrithms or statistical models.

Figure-2.1 has shown the hierarchy of the various classification methods. We will briefly explain each

technique in the following sections.

2.2.1 Port-based Classification

IANA assigns port numbers for Internet applications for performing their operations. Port

based classification inspects the port numbers of the traffic to classify applications. Many applications deviate from standard port numbers to avoid detection and bypass firewall. Due

to rise of peer-to-peer applications, dynamic port allocation does provide accurate classification

(17)

Traffic Classification Methods

Port Payload Transport _layer Machine

Learning Graph Statistical

Unsupervised Learning

Supervised Learning

Figure 2.1: Various Network Traffic Classification Methods

For byte based accuracy, the accuray was observed to be no better than 70% . For flow

accuracy, the value of accuracy fell to 30-50% . Hence, port numbers has been less reliable due

to deviatation from conventional port numbers [14].

2.2.2 Payload-based Classification

Payload-based classification identifies charateristics bit strings in packet payload that depict

controlling signals for a particular application. This method are applied majorly for peer-to-peer application and intrusion detection traffic due to their variant nature. Signatures are

derived from frequent bit patterns, string position, fixed or variable offset and even behavioral

signatures [15].

For peer-to-peer applications-, the overall accuracy is as high as 97.39%. Moreover,

combi-nation of port and payload based classification increases accuracy to almost 79% [18]. Although

payload classification provide more reliable and accurate results than port based classification, it imposes processing and storage overhead constraints. The generation of application

signa-tures is a tedious process with intense string searching and pattern matching. Signasigna-tures needs

to updated frequently as new trends in the applications are evolved. Hence, payload classifica-tion does provide considerable accuracy but at the cost of processing, storage and computaclassifica-tion

(18)

2.2.3 Transport-layer based classification

Port-based had limitations of reliability and low accuracy whereas payload-based had large

processing and storage requirements. To overcome these limitations, transport-layer heuristics

identifies unique behavior patterns of applications during transferring and signalling phase. They utilize only transport layer header information for flow traffic. From Figure-2.2 reproduced

from [12], we can see the interaction between hosts at various social, functional and application

level defines heuristics for classification.

Figure 2.2: Application behavior patterns based on transport layer attributes

From [12], we can observe that there are more that 90% flows with high accuracy values of

95%. Futhermore, both accuracy and completeness exceeds 80%. It overcomes limitations of

(19)

not distinct applications such as eMule, BitTorrent. In addition, the behavior of hosts are more

prominent at the edge of the network and hence limiting its use in the backbone networks.

2.2.4 Machine Learning

Machine learning (ML) techniques has two phases- training and testing. In the training phase,

the heuristics is built for the classification model. The training dataset analyzes characterisitics

of flow datasets to build the classification model. The model is then subjected to test dataset for classification of applications.

Figure 2.3: Application behavior patterns based on transport layer attributes

Figure-2.3, reproduced from [16], provides us understanding of network traffic classification by ML-based model. The ML-based classification model working can be illustrated as follows

[16]:

1. Real Time Traffic Capture−Network traffic is mirrored to a particular port and then collected in a memory unit. The collection can take place at backbone of the network

(20)

some classification methods. For instance, classification based on transport layer attribute

determinng host behaviors become more effective at the edge of the network.

2. Flow Statistics Processing − The captured packets are converted to flows. Various statistics related to flows such as mean packet inter-arrival time, medialn packet length

or flow duration are calculated. NetFlow becomes a convenient tool for capturing and processing network traffic in such class model.

3. Data Sampling−There are large data sets captured for a particular session of captured traffic trace. Processing and memory power required for these operation is extremely high. Data sampling provides us means to narrow down searches in large datasets without losing

valuable information.

4. Training Process−A fraction of dataset is used for training the classification algorithm. The training datasets helps to understand characteristics of flow traffic and enhances

classification for testing dataset. Training data set continously updates the model in real

time to keep itself updated with changing behavior patterns.

5. Classification Model−Classification model consist of various clustering, data mining, and information heuristics techniques. Classification learning involes machine learning

from training phase to build a set of classification rules to classify unseen traffic. During training, behavior patterns are analyzed in various application classes. During testing,

the traffic flow matching such patterns and charateristics are classified accordingly.

6. Classified Output − Flows are labeled with application class according to the classi-fication model. There can be clustering of flow sets with similar characteristics. Any

association between feature can also be presented with a probabilistic estimatation.

There are two types of ML techniques − Supervised learning and Unsupervised learning. Supervised learning requires learning phase to map the ML classes to application. For this reason, this ML is more popular for application identification for a particular group of class.

Unsupervised learning relies on automatic discovery of classes through clustering of datasets.

The clusters needs to be labeled to a particular application through direct inspection by a human expert [16].

There are number of algorithms such as EM algorithms, Naive Bayes, k-NN used for machine

learning technique of network traffic classification. We observe high accuracy of 81.51 % for unsupervised classifier using Naive Bayes. Moreover, there was higher accuracy of 90.51%

achieved for AutoClass unsupervised learning technique. Due to use of flow statistics, storage

(21)

recognized due to unsupervised clustering techniques. But there is need of human intervention

for accurately mapping groups clusters to application class [6].

2.2.5 Graph

In [10], network traffic is analyzed in an innovative by looking at flows in form of Traffic

Disper-sion Graphs (TDG). The graph based peer-to-peer detection (Graphtion) causes partitioning

traffic flows into clusters depending on flow attributes. Figure-2.4, reproduced from [10], shows the graph-based interaction of various hosts for client-server for HTTP application.

Figure 2.4: Graph-based classification for HTTP-TG application

In this method, there is around 90% precision and recall for peer-to-peer detection. For a

(22)

of the method is that there is estimation of number of clusters required forK-Mean algorithm.

2.2.6 Statistical Classification

The basis of this method is that applications has unique statistical characteristics such as bytes, duration, packet interarrival times, flow duration. These attributes help distinguish different

applications by finding relationship between traffic class and its observed statistical properties

[16]. As noted in [19], the connection characteristics are analysed for constructing empherical models using bytes, duration, arrival periodicity for a number of TCP connection.

The statistical methods observe accuracy values of 80-90% for internet chat systems [4]. But there are limitation due to its dependency on link utilization and network performance. Critical

attributes as distribution of interarrival times are heavily dependent on link utilization and

(23)

Chapter 3

Classification Model

In this chapter, we are going to understand the classification model for achieving accurate flow classification by using clustering and associative rule mining techniques. The main motive is

to combine unsupervised learning techniques with associative rule mining to achieve finer and

more accurate flow classification model.

3.1 Classification Model

The main part of the algorithm consists of unsupervised learning techniques with associative

rule mining, to achieve finer accurate flow classification. The methodology that was followed

is depicted in Fig.3.1 and was determined such that its implementation is similar to the one developed in [16].

3.1.1 Processing and Flow Data Processing

Libtrace is used to process trace file in legacy ATM format to pcap format [5]. In order to

convert the pcap traces to flows, we used conversation option of tshark tool [23]. The output

flows are conversation between two hosts with IP, port, duration and packet size information. The flow output from tshark is parsed and stored in the database. The following parameters

attributes of the flow are stored in the database.

1. Source IP

2. Source Port

3. Destination IP

(24)

Processing & Flow Data

Transductive Classification

Clustering Association

Rule Classification

Classified Output Machine Learning

Classification Process

Figure 3.1: Classification Model

Table 3.1: Distribution of application class in traces

Application Class No. of Flows Flow Proportion Port Number

HTTP 24568 9.3% 80, 443

DNS 9284 3.5% 53

SMTP 2197 0.83% 25

MAIL 1464 0.55% 110

IRC 553 0.21% 113

5. Source Byte and Source Packets

6. Destination Byte and Destination Packets

7. Total Byte and Total Packets

8. Protocol

Trace Statistics and Methodology

Port-based classification is used for defining application class of a particular flow. Since the

traces are captured for year 2001, peer-to-peer applications were not prevalent leading to well

behaved network traffic with standard port numbers. We summarize traffic class distribution in our traces as follows.

As the classes are not evenly distributed, the training sets will not present accurate results

(25)

data set for the classification model. This allows a fair training for the classification model with

no redundancy and equal application class distribution.

3.1.2 Clustering

We implement unsupervised learning techniques for classification. Due to ever changing

appli-cation patterns, we choice clustering technique to discover natural clusters in the data using

internalized heuristics. They define cluster flows with similar patterns or properties according to evaluation metrics such as Euclidean distance, entropy or Bayesian information number.

There are two clustering methods evaluated in this work - K-Mean and Model-based clustering methods.

K-Mean Clustering

K-Mean Cluster method uses attributes of datasets to cluster objects in to k-partitions. The

sum of squared distances from each data point to cluster center is minimized iteratively. The

squared error function is given as,

E=

k

X

i=1

X

xj∈Si

|xj−µi |2

where there are kclustersSi, i = 1,2,...,k and µi is mean value of all the pointsxj ∈Si.

The algorithm randomly partitions dataset into k-clusters. For each k clusters, the mean

values is calculated. The intra-cluster variance or squared error function is calculated for each data with respect to all mean values. The clusters are recomputed in a way to reduce square

error value iteratively. The algorithm converges when data point no longer switch clusters.

In our work, R function kmean is used for k-means analysis. There are two arguments of the functions - number of clusters (k) and number of iteration (iter.max). After number of

iteration, we were able to empirically conclude the value of iteration to be 10. The value of k

was increased to understand the growth in accuracy, precision and recall of each application class.

K-Mean method is extremely fast and with quick convergence. But there is need to estimate

number of clusters (k). The algorithm is more effective for spherical clusters naturally present in the dataset. Also, due to random selection of clusters, the results will not be same for each

(26)

Model-based Clustering

Model-based clustering is used for grouping flow data and based on a strong framework to

derive knowledge from data. This clustering technique is built on an assumption that data were

generated by a particular model and tries to find the original model from data. With number data model recovered, the number of clusters and its characteristics can retrieved. In general

model-based clustering, the model parameter is selected by maximum likelihood criterion that

maximizes the log-likelihood of generating data. Model-based clustering is more flexible. The adaptive nature is advantageous for various distributions of data such as Bernoulli, Gaussian

with non-spherical variance, or member of different family [7].

In MCLUST package, Bayesian Information Criterion (BIC) provide comprehensive strate-gies for parameterization and estimation of number of clusters. Maximum Likelihood is used as

a criterion for estimating the model. The prediction of number of clusters is achieved by using Bayesian Information Criterion.

A normal or Gaussian mixture model is assumed,

n Y i=1 G X k=1

τkφk(xi|µk,Σk)

where,

x = data

G = number of components

φk(x|µk,Σk) = (2π)

−π

2 exp{−1

2(xi−µk)TΣ

−

k1(xi−µk)}

τk = probability that an observation belongs to the kth component(τk≥0;sumGk=1τk= 1) n = number of data points

Σ = covariance matrix

The classification likelihood with parameterized normal distribution assumed for each class:

n

Y

i=1

φli(xi|µli,Σli)

where li= labels indication a unique classification of each observation;li=k ifxi belongs

to thekth component.

Ellipsoidal distribution centered at means µk is observed for both component and cluster models. Geometric features are determined by covariances Σk. Each covariance matrix is

paratermized by eigen value decomposition in the form

(27)

Table 3.2: Parameterization of Covariance Matrix in Mclust Package

Identifier Model HC EM Distribution Volume Shape Orientation

E √ √ (univariate) equal

V √ √ (univariate) variable

EII λI √ √ Spherical equal equal NA

VII λkI

√ √

Spherical equal equal NA

EEI λA √ √ Spherical variable equal NA

VEI λkA

√

Diagonal equal equal coordinate axes

EVI λAk

√

Diagonal equal variable coordinate axes VVI λkAk

√

Diagonal variable variable coordinate axes

EEE λDADT √ √ Ellipsoidal equal equal equal

EEV λDkADT_k √ Ellipsoidal variable equal variable

VEV λkDkADT_k √ Ellipsoidal variable equal variable

VVV λkDkAkDT_k

√ √

Ellipsoidal variable variable variable

where,

Dk = orthogonal matrix of eigenvectors

Ak = diagonal matrix whose elements are proportional to the eigenvalues of Σk λk = scalar

In the model, principal component orientation for Σk depends on Dk. Also, the shape of

the density contours is determined byAk. Volume of the corresponding ellipsoid is specified by λkm which is proportional to λdk |Ak| where d represents data dimension. The characteristics

of the geometric features −orientation, volume and shape are estimated from data. These can vary between clusters, or restricted to the same for all clusters. In one dimension dataset, equal

variance and varying variance is represented by E and V respectively. For multiple dimensions, geometric characteristics of the model such as volume, shape and orientation is taken into

consideration. We can see from Table-3.2 that there are 9 models available in the Mclust.

The volume, shape and orientation represents clustering of flow data. In our case, VEV model represents volumes of all clusters as varying (V), shapes of all clusters as equal (E) and

orientation of all clusters as varying (V). BIC is an approximation to the Bayes factor. It adds

a penalty term on the log-likelihood depending on number of parameters.

BIC(k,Σ)≡2∗p(x|Θ, k,Σ)−v(k,Σ)∗log(n)

where, k is the number of clusters, p is the conditional probability; π stands for different variance-covariance structures andv(k,Σ) is the number of free parameters in a model with k

(28)

and model with the highest BIC is the best model for the flow dataset.

3.1.3 Transductive Classification

Due to clustering, we have grouped data sets with similar properties. These clusters need to

map to application class using a function or algorithm. Transductive classifier can be applied

to any clustering technique for classification. Consider train set is represented by Xm. There are l number of classes already labeled for the train set where i = 1,2,3, .. ,l. Now, Xn is test set in the model. We feed complete train and test set Xm+n to clustering method. The clustering methods partitions in to k-subsets. To find the class of a particular cluster, we need to find most common labels in that cluster according to train set labels. After clustering, the

transductive classifier algorithm can be built as follows,

• Label cluster with class iif the most common label in the cluster is i,i∈ 1, 2, .. l.

• Select uniformly at random any value of i for tie between two or more labels in a cluster or when no labels present.

The transductive classifier searches for labels from training datasets in the cluster. It labels application class corresponding to the most common label in the cluster. The remaining data

points have high probability of having similar properties as they are clustered together

depend-ing on algorithm heuristics. Thus, transductive classifier maps application class to clusters depending on train dataset labels.

3.1.4 Association-Rule Classification

Association Rule Classification unit provides a second level of heuristics to achieve finer clas-sification. Association rules find regularities between flow parameters with different measures

of interestingness for applications from transductive classifier output. Apriori Algorithm is

used for association rule learning [1]. The derived rules are traced back in the main trace and identified flows attain high accuracy, as will be shown in the next section. Moreover, the rule

association also helps predict IPs and ports used for servicing an application in the future. The

rules heuristics applied to flow data causes accurate classification and thus making classification method finer due to association rule mining techniques.

The association rule mining technique help we derive valuable relations within the data

points. In our case, such association will help us recognize pattern in a particular application. The goal of the data mining and classification models is to build a heuristic for predictive

recognizing the occurrence of application class from the data points.

(29)

an-tecedent. In other words, an event occurring due to antecedent is followed by the consequent

is depicted by the association rules for a particular class. The rules are represented as,

Antecedent => Consequent

Each rule is associated with three parameters, [1].

1. Support is the percentage of transactions that the rule can be applied to (the percentage

of transactions, in which it is correct).

2. Confidence is the number of cases in which the rule is correct relative to the number of cases in which it is applicable (and thus is equivalent to an estimate of the conditional

probability of the consequent of the rule given its antecedent).

3. Lift is the ratio of the probability that antecedent and consequent occur together, to the

multiple of the two individual probabilities for antecedent and consequent. There are

conditions where both support and confidence is high, and still result in to invalid rule. Therefore Lift indicates the strength of a rule over random occurrence of antecedent and

consequent, given their individual support. It provides information about improvement and increase in probability of consequent for a given antecedent. In the other words, Lift

is given as,

Rules with lower lift and confidence values are filtered out. Confidence is measured for

certainty of the rule. It measures event’s item set that matches antecedent of the implication

in the association rule and also matches consequent. The association rule indicates an affinity between antecedent and consequent with evaluation parameters such as Support, Confidence

and Lift.

Apriori Algorithm

Apriori Algorithm are based on the concept ofprefix tree. There are two steps for this algorithm

− frequent item sets and rule generation. In frequent item sets, the set of items maintaining least support are filtered. In the second phase, there will be rule generation from the computed

frequent item set. Apriori gains high efficiency by observing no superset of an infrequent item set.

Algorithm 1 has Apriori property stating that any subset of frequent item set must be

frequent. Join operation has a set of candidate k-item sets by joining Lk+1 with itself to find

Lk. Lastly, frequent item set are sets of item with minimum support. This represented by Li

(30)

Algorithm 1 Apriori Algorithm

Ck: Candidate item set of size k Lk: frequent item set of sizek Lk ={frequent item s}

for k= 1; Lk=φ;k++ do

Ck+1 = candidates generated fromLk

fortransaction tin database do

Increment count of all candidates in Ck+1 that are contained in t

Lk+1 = candidates inCk+1 with minimum support

end forreturn ∪k Lk

(31)

Chapter 4

Analysis and Results

In this chapter, we evaluate our classification model for two clustering methods -K-Mean and Model-based clustering with second level of classification heuristics using Apriori algorithm.

We perform cluster analysis and application profiling for understanding behavior patterns with

probabilistic prediction for each classified flow.

4.1 Traces

Waikato Internet Traffic Storage (WITS) is an open community for collecting data traces for

traffic measurement and identification research. Auckland VI traces for 2001-06-11 are used for analysis and testing of classification model. TCP, UDP and ICMP traffic and packets are

zeroed for length more 64 bytes. Anonymised IP are having one to one mapping for IP in the

range 10.0.0.0/8. The direction of the traffic coming towards the University Network [2]. Community Resource for Archiving Wireless Data At Dartmouth (CRAWDAD) provide us

with wireless traces for research community. Packet headers from every wireless packet sniffed in four of the campus buildings. The buildings were among the most popular wireless locations

in 2001, and included libraries, dormitories, academic departments and social areas. The MAC

address were sanitized by randomizing bottom six hex digits. IP address are anomalized using prefix-preserving IP sanitizer as described in [25]. In this technique, the two IP addresses

sharing the same k-bit prefix result in to a sanitized address with same k-bit prefix [13].

4.2 Classification Metrics

The classification model is evaluated by the conventional machine learning metrics such as

Precision, Recall and Accuracy [6]. The data set from classification algorithms can be placed

(32)

Table 4.1: Evaluation Metrics for Machine Learning Classification Techniques

Classified as X X¯

X TP FN

¯

X FP TN

used as evaluation parameters for machine learning classification algorithms.

Table-4.1 explains relationship of True Positive, False Positive, False Negative and True

Negative. In rows, X is represented as positive presence and ¯X is represented as negative presence. In columns, X is represented as True value and ¯X is represented as False value.

1. True Positive(TP): Total percentage of members (in our case flows) classified as Class A

belong to Class A.

2. False Positive(FP): Total percentage of members of Class A but does not belong to Class A.

3. False Negative(FN): Total percentage of members of Class A incorrectly classified as not belonging to Class A.

4. True Negative(TN): Total percentage of members which do not belong to Class A are

classified not a part of Class A. It can also be given as (100% - FP).

These metrics will help us form the overall accuracy, precision and recall of the classification model. We define these terms as follows:

Precision is given as the ratio of True Positives to True and False Positives. This value helps us understand the classification capability to identify objects correctly.

P recision= T rueP ositive

(33)

Recallis given as the ratio of True Positives to True and False Negatives. This value helps us understand the classification capability to determine misclassified members are something else.

Recall= T rueP ositive

T rueP ositive+F alseN egative

Overall Accuracyis given as the ratio of all True Positives to all True and False Positives for all classes.

OverallAccuracy =

Pn

i=1(T rueP ositive)

Pn

i=1(T rueP ositive+F alseP ositive)

4.3 Analysis and Results

We have utilized two clustering algorithms- Kmean and Model-based clustering algorithms. For

Kmean clustering, k indicates the number of clusters and varied from 50 to 500 with a step

of 50. For Model-based classification, there is no need estimation of number of clusters in the dataset. Each algorithm clusters are passed through transductive classifier and finally precision,

recall and accuracy is calculated and plotted. The following section will have analysis of each

algorithm with their respective performance graphs.

4.3.1 K-Mean Clustering Precision Graph

The values of precision increases considerably as number of cluster increases as seen

fromFig-ure 4.1. Value of precision is in the vicinity of 60% for all applications for K<100. We observe that the value of precision reaches a high value but with a tradeoff of large number of clusters. For k = 500, the precision is in the vicinity of 92% but test data set is 1000 flows.

As the number of cluster size increases, the number of flows in each clusters become smaller.

For every iteration, the mean value for each cluster is tightly bound with other flows in the cluster as it eliminates false positive flows in each cluster.

Large cluster number will cause heavy computation stress on the classification as each cluster will assist in deriving application patterns. As precision is a measure of classifier ability to find

correct flows in a class, we would interpret that it performs fairly for increase in k.

Recall Graph

Recall gives us the measure of the rejection of flows from the other class. The flows from other

(34)

`

50 60 70 100 150 200 250 300 350 400 450 500

0 10 20 30 40 50 60 70 80 90 100 KMean Transductive Classifier Precision Graph

http mail smtp dns irc

Cluster Size (k) P re ci si on ( % )

Figure 4.1: K-Mean Transductive Classifier Precision Graph for WAND Traces

`

50 60 70 100 150 200 250 300 350 400 450 500

0 10 20 30 40 50 60 70 80 90 100 KMean Transductive Classifier Recall Graph

Cluster Size (k) R ec al l ( % )

(35)

for high values of k. We have observed recall in the vicinity of 94% for k>400 for all application class.

Accuracy Graph

`

50 60 70 100 150 200 250 300 350 400 450 500

0 10 20 30 40 50 60 70 80 90 100 KMean Transductive Classifier Accuracy Graph Cluster Size (k) A cc ur ac y (% )

Figure 4.3: K-Mean Transductive Classifier Accuracy Graph for WAND Traces

From Figure 4.3, the overall accuracy is given by the aggregate of precision for all the

application class. We observe high accuracy of 92% for k= 500 and low accuracy of 64% for k=50. As the have been growth of precision for all the classes, this has been well reflected with

high overall accuracy.

The growth in the precision and recall of the cluster is due to fine segmentation of the flow clusters due to increase in cluster number. As there is increase in number of clusters, the

K-Mean iteration for each k cluster are more tightly clustered by reducing within-cluster sum of squares. This results in to clusters of particular application class with unwanted flows from other class being eliminated. We can see very evidently with increase in precision for all the

application class.

(36)

application class observe different behavior, the value of k cannot simply be equal to number of

application class. Hence it is difficult to estimate k value for each new dataset as well as maintain high accuracy and precision. Moreover, large flow sets will have limitation of processing power

due to high iteration required for accurate results.

4.3.2 Model-based Clustering

Figure 4.4: Bayesian Information Criterion distribution for parameterization

Mclust uses Bayesian Information Criterion to estimate models and number of clusters.

From Fig.4.4, we can see that BIC value for each of the nine models is calculated. The model with the highest value will be chosen as the best estimate of the input data. In our case,

BIC is strongest for number of components equal to 9 corresponding to VEV model. VEV

model represents volumes of all clusters as varying (V), shapes of all clusters as equal (E) and orientation of all clusters as varying (V). BIC is an approximation to the Bayes factor. It adds

a penalty term on the log-likelihood depending on number of parameters. The best model

estimated is VEV model. VEV indicates volume and orientation is variable (V) and shape is equal (E) resulting intoEllipsoidal distribution. The number of clusters is very close to number

(37)

Flow sets have source and destination IP and port numbers and hence can be described as

multidimensional dataset. Hence , we can observe that geometric features come in to consid-eration for data model. From the model, we can estimate that flow points will be ellipsoidally

placed in space.

Precision Graph

`

0 10 20 30 40 50 60 70 80 90 100 Modelbased Transductive Classifier Precision Graph Application P re ci si on ( % )

Figure 4.5: Model-based Transductive Classifier Precision Graph for WAND Traces

By narrowing down a particular data model for test data, Mclust accurately estimates the

clustering parameters using BIC . VEV indicates volume and orientation is variable(V) and

shape is equal(E) with Ellipsoidal distribution.

Model-based hierarchical clustering is used for this particular ellipsoidal distributed data

model. Each flow behaves as an individual cluster and pairs are merged as it progresses up

the hierarchy. This is bottom-top approach as individual flows are considered as initialization point. Thus, flows with similar properties are clustered together as it moves up the hierarchy.

(38)

strong precision for 100% for HTTP, IRC and SMTP flows.

Recall Graph

`

0 10 20 30 40 50 60 70 80 90 100

Modelbased Transductive Classifier Recall Graph

Application

R

ec

al

l (

%

)

Figure 4.6: Model-based Transductive Classifier Recall Graph for WAND Traces

False positive rejection has been prominent for SMTP application. SMTP applications are used for mail relaying, user-client application and delivery only protocol. Hence, SMTP flows

are communicating with mail server on set of port numbers resulting into strong clustering.

This can be reinforced due to high precision value for SMTP. The recall is consistent for HTTP, DNS, IRC, Mail The rejection is strong of order of 80% for most the applications.

Overall Accuracy

Cluster Size Overall Accuracy

9.0 81.5

The overall accuracy is higher for this particular clustering classifier. As the number of

(39)

Table 4.2: Precision & Recall for TranductiveK-Mean and Model-based classification Application K-Mean Precision Model Precision K-Mean Recall Model Recall

http 79.58 100 65.63 81.5

mail 86.68 73.02 97.5 78.5

smtp 76.1 100 76.1 100

dns 76.12 55.28 84.67 78.5

irc 75.42 100 75.88 69

The package has features of combining model-based hierarchical clustering, expectation

and maximization for Gaussian mixture models with parameter estimation using BIC. The

probability density function is calculated for observation belonging to each class. A covariance matrix (i,j) is derived with probabilistic value of ith flow belonging to the jth class. Hence, due

to the data modeling and parameter estimation, the clustering algorithm converge to a finer

flow classification [7]. Moreover, the number of cluster is predicted by BIC and is also very close to the actual number of application class.

The gain in accuracy is because model-based clustering techniques assumes data belongs to a particular model. There is an accurate estimation of model leading enhanced clustering. As

opposed toK-Mean, there is no need of human intervention for predicting cluster size for each flow data.

4.3.3 Rule based Classification

Fig.4.7 shows accuracy for K-Mean clustering followed by Apriori Algorithm. There are weak association inK-Mean clusters because there is fall in accuracy for all application class.

SMTP observed lower accuracy for all the range of cluster size. SMTP have user-client application and messaging services running for a particular host. The messaging services varies

in destination port numbers. Moreover, relaying of services for mail services results in varied

host IPs. K-Mean clustering does not accurately converge to the mean values for this application class. One of the reason is that mean values are randomly selected for initialization of the

algorithm.

On the other hand, HTTP and DNS have gained high accuracy due to client-server protocol behavior. HTTP servers and DNS servers are set of host application that are constantly being

communicated by user hosts. These IP are clustered together due to few servers to many client communication patterns.

The goal of the second level of association rule classification is to identify flows and segregate

(40)

0 250 300 350 400 450 500 10

20 30 40 50 60 70 80 90 100

KMean and Rulebased Classification Accuracy Graph

Cluster Size

A

cc

ur

ac

y

(%

)

Figure 4.7: K-Mean and Rule-based Classification Accuracy Graph for WAND Traces

clusters map to flows with other application class indicating poor false positive rejection.

Fig.4.8 refers to model-based clustering with apriori algorithm for association rule

classifi-cation. Due to accurate model estimation capability, the nature of the cluster is predicted with Bayesian Information Criterion. The estimation leads to appropriate choice of distribution,

leading to an enhanced clustering result.

Association rule based on flow attributes identify accurately flows having similar application behavior. Due to strong data evaluation capability of Model-based clustering, the association

rule heuristics have high confidence value for application class.

We tested our classification model with other traces to nullify dependency of particular network or data traces. Crawdad data traces does observe accuracy on similar range as WAND

traces. WAND traces average accuracy comes up to 94% and Crawdad comes around 93%. As

(41)

`

http mail snmp dns irc

0 10 20 30 40 50 60 70 80 90 100

Modelbased and Rulebased Classification Accuracy Graph

WAND Crawdad

Application

A

cc

ura

cy

(%

)

Figure 4.8: Model-based clustering and Rule based classification Accuracy for WAND and Crawdad traces

Table 4.3: Accuracy for Association Rule classifier based onK-Mean and Model-based cluster-ing for WAND and CRAWDAD Traces

Application K-Mean Accuracy WAND Trace CRAWDAD Trace

http 61.48 100 100

mail 42.94 94.05 87.82

smtp 11.28 89.34 94.87

dns 61.67 92.11 88.15

(42)

Figure 4.9: Distribution of flow attributes in scatter plot matrix

4.4 Cluster Analysis

In Fig.4.9, the distribution of IP and ports for source and destination is depicted for

Model-based clustering method. The graph has diagonal blocks mentioned with the parameters in

that row and column. In order to consider source and destination IP distribution, we need to go at plot interesting source IP row/column and destination IP row/column.

• For HTTP (blue) application, large number of source IPs direct traffic too few destination IPs. HTTP behavior is prominent because a number of source hosts are communicating

with small number of destination hosts i.e. HTTP servers.

• In DNS application, source hosts communicate with a set of DNS servers. Each host interacts with different domains such as www, .org and .net for resolving IP addresses.

The DNS behavior is prominent because large group of source hosts communicate with

small repeated set of destination IPs.

• In IRC application, a group of hosts communicates with other group of hosts. There is roughly one-to-one correspondence between source and destination IP. This behavior is

prominent because IRC (cross) traffic has set of source IPs communicate with

(43)

• Moreover, source IP and destination port distribution show that destination port numbers are in the range of 30000 for SMTP (green) traffic. DNS destination ports are randomly distributed where as HTTP and MAIL have destination ports number in lower range of

1000.

4.5 Behavior Pattern for Application

The association rule mining technique help we derive valuable relations within the data points.

In our case, such association will help us recognize pattern in a particular application. The goal data mining and classification models is build a heuristics for predictive for recognizing

the occurrence of application class from the data points.

Rules with lower lift and confidence values are filtered out. Confidence is measured for certainty of the rule. It measures event’s item set that matches antecedent of the implication in

the association rule and also matches consequent. Table I clearly mentions flow parameters in

antecedent and consequent format for building association rules. The association rule indicates an affinity between antecedent and consequent with evaluation parameters such as Support,

Confidence and Lift.

For example, for HTTP, destination IP with index 1 and destination port number equal to 1273 will have a flow from source IP of index 233 with confidence (noted as C) of 1 and lift of 81.5. When we trace back this flow, we are accurately able to classify this flow as HTTP.

Moreover, in the future, if we observe any particular flow with this association then we would be certain to classify it as HTTP with confidence value of 1.

We can observe that each application are profiled according to their respective behavior.

Every rule confidence value tells us that flow with this particular association will be belonging to that corresponding application class. Thus, we understand the following for each application

behavior profiling: It is used for resolving host names with IP address. Domain Server services

queries for any particular host. Hence, we observe centralized architecture with client-server communication pattern. In Table-4.4, we can clearly observe that source and destination IPs

are associated with each other. There are host pairs (200,46), (187,32) and (19,69) representing

communication between host and domain server for DNS traffic. Various user messages are ser-viced by using mail application. There is a mail server servicing, routing and relaying messages

between two parties. The communication behavior of relaying and messaging is prominently

observed in Mail and SMTP application. In IRC application, different destination port number with set IPs communicating with each other is very well depicted.

Hence, we are able to gain higher accuracy by using apriori algorithm after transductive classifier. The design helps us find valuable association rules for classified data with estimation

(44)

Table 4.4: Association Rules from Apriori Association Algorithm

Traffic Rules Confidence Lift

DNS dip=16 =>dport=56322 1 21.8

dport=4207 => dip=43 1 17.7

dstip=19 =>srcip=69 0.02 35.5

srcip=69 =>dstip=19 0.02 35.5

srcip=200 => dstip=46 0.1 9.7

dstip=46 =>srcip=200 0.1 9.7

srcip=187 => dstip=32 0.2 3.5

dstip=32 =>srcip=187 0.2 3.5

Mail dip=8 =>dport=53 1 6.7

dip=16 =>dport=53 1 6.7

srcip=226 => dstip=49 1 53.75

dstip=49 =>srcip=226 1 53.75

dstip=8 =>dstport=53 1 6.7

srcip=200 => dstip=51 1 4.57

dstip=51 =>srcip=200 1 4.57

srcip=95 =>dstip=30 0.4 8.95

dstip=13 =>dstport=53 0.4 6.71

HTTP dip=1 & dport=1273 =>srcip=233 1 81.5

{} =>dip=1 0.7 1

srcip=139 => dstip=1 1 1.4

srcip=9 =>dstip=1 1 1.4

srcip=41 =>dstip=1 1 1.4

dstport=2006 =>dstip=1 1 1.4

dstip=41 =>srcip=164 40.75

srcip=233 & dstip=1 =>dstport=1273 1 81.5

IRC sip=162 =>dport=37273 1 69

{} =>dip=16 0.7 1

dstip=3 & dstport=39938 =>srcip=18 1 46 srcip=168 & dstport=37321 =>dstip=3 1 4.4 srcip=168 & dstip=3 =>dstport=37321 1 69 dstip=3 & dstport=37321 =>srcip=168 1 69 srcip=162 & dstport=37273 =>dstip=16 1 1.3 srcip=162 & dstip=16 =>dstport=37273 1 69 SMTP sip=262 & dport=4868 => dip=28 1 2.8

dip=28 & dport=4868 =>sip=262 1 15.3

srcip=196 => dstip=3 1 5.7

dstport=4706 =>srcip=261 1 22.2

dstport=4706 =>dstip=28 1 2.8

srcip=277 => dstip=43 1 7.1

srcip=241 => dstip=3 1 5.7

(45)

Chapter 5

Conclusion

5.1 Summary of Results

In this work, we have presented a classification model that achieves high flow classification accuracy with application behavior profiling. Association rules derived from transductive

clas-sification are fed back to refine clasclas-sification model. This leads to a finer clasclas-sification model

with much better accuracy. The rule heuristics are produced without any human intervention and become independent of dataset. In addition, our model is able to detect new behavior

patterns for next generation applications. We summarize our results in the following points:

• We built two level classification models with unsupervised learning techniques. Clustering transductive classification was used for first level and association rule mining for second

level of classification. Model-based clustering and Apriori Algorithm form basis of the

classification model.

• We achieve accuracy value of around 94%. The precision value for all application class remains high, and independent of trace sets.

• Each application has behavior profiling due to use of Apriori Algorithm. The behavior patterns tells us various flow attributes so that we can identify and classify flows in the

future.

• Each association rule is providing with confidence value to gain accurate results. More-over, support provides information of the number of flows present in the dataset. In addition, Lift defines strength of the relation between antecedent and consequent.

• Due to utilization of flow parameters, the processing and computational power is less. Large flow data sets can be analyzed with accurate results and application behavior

(46)

5.2 Future Work

In the future work, we are planning to capture trace files for reviewing our classification model.

We plan to analyze wide variety of applications with different behavior patterns and connection norms. With accurate classification, we want to build a prediction model for application class.

This will enhance the capability of classifying as well as monitoring of particular traffic in the

(47)

REFERENCES

[1] C. Borgelt and R. Kruse. Induction of association rules: Apriori implementation. In Comp-stat: Proceedings in Computational Statistics: 15th Symposium Held in Berlin, Germany, 2002, page 395. Physica Verlag, 2002.

[2] WAND Trace Catalogue. http://www.wand.net.nz/wits/catalogue.php.

[3] IOS Cisco. Netflow white papers, 2006.

[4] C. Dewes, A. Wichmann, and A. Feldmann. An analysis of Internet chat systems. In Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement, pages 51– 64. ACM, 2003.

[5] Libtrace Documentation. http://www.wand.net.nz/trac/libtrace.

[6] J. Erman, A. Mahanti, and M. Arlitt. Internet traffic identification using machine learning. InProceedings of IEEE GlobeCom. Citeseer, 2006.

[7] C. Fraley and A.E. Raftery. MCLUST version 3 for R: Normal mixture modeling and model-based clustering. Technical report, Citeseer, 2006.

[8] P. Haffner, S. Sen, O. Spatscheck, and D. Wang. ACAS: automated construction of ap-plication signatures. In Proceedings of the 2005 ACM SIGCOMM workshop on Mining network data, page 202. ACM, 2005.

[9] IANA. Internet Assigned Numbers Authority (IANA).

http://www.iana.org/assignments/port-numbers.

[10] M. Iliofotou, H. Kim, M. Faloutsos, M. Mitzenmacher, P. Pappu, and G. Varghese. Graph-based P2P traffic classification at the internet backbone. InIEEE Global Internet Sympo-sium. Citeseer, 2009.

[11] T. Karagiannis, A. Broido, and M. Faloutsos. Transport layer identification of P2P traffic.

In Proceedings of the 4th ACM SIGCOMM conference on Internet measurement, pages

121–134. ACM, 2004.

[12] T. Karagiannis, K. Papagiannaki, and M. Faloutsos. BLINC: multilevel traffic classification in the dark. ACM SIGCOMM Computer Communication Review, 35(4):240, 2005.

[13] David Kotz, Tristan Henderson, Ilya Abyzov, and Jihwang Yeo. CRAW-DAD trace dartmouth/campus/tcpdump/fall01 (v. 2004-11-09). Downloaded from http://crawdad.cs.dartmouth.edu/dartmouth/campus/tcpdump/fall01, November 2004.

(48)

[15] A.W. Moore and K. Papagiannaki. Toward the accurate identification of network applica-tions. Passive and Active Network Measurement, pages 41–54, 2005.

[16] TTT Nguyen and G. A rmitage. A survey of techniques for internet traffic classification using machine learning. IEEE Communications Surveys & Tutorials, 10(4):56–76, 2008.

[17] I. Papapanagiotou and M. Devetsikiotis. Aggregation Design Methodologies for Triple Play Services. InIEEE CCNC 2010, Las Vegas, USA, pages 1–5, 2010.

[18] B.C. Park, Y.J. Won, M.S. Kim, and J.W. Hong. Towards automated application signa-ture generation for traffic identification. In IEEE Network Operations and Management Symposium, 2008. NOMS 2008, pages 160–167, 2008.

[19] V. Paxson. Empirically derived analytic models of wide-area TCP connections.IEEE/ACM Transactions on Networking (TON), 2(4):336, 1994.

[20] V. Paxson. Bro: A system for detecting network intruders in real-time.Comput. Networks, 31(23):2435–2463, 1999.

[21] M. Roughan, S. Sen, O. Spatscheck, and N. Duffield. Class-of-service mapping for QoS: a statistical signature-based approach to IP traffic classification. In Proceedings of the 4th ACM SIGCOMM conference on Internet measurement, pages 135–148. ACM, 2004.

[22] S. Sen, O. Spatscheck, and D. Wang. Accurate, scalable in-network identification of p2p traffic using application signatures. InProceedings of the 13th international conference on World Wide Web, page 521. ACM, 2004.

[23] tshark Documentation. http://www.wireshark.org/docs/man-pages/tshark.html.

[24] J. Wang, Z. Ge, H. Jiang, S. Jin, and A.W. Moore. Lightweight application classification for network management, July 2008.

(49)

(50)

Appendix A

Application Behavior Profiling

A.1 DNS behavior Pattern

no rules support confidence lift

1 srcip=180 => dstip=6 0.010 1 71

(51)

A.2 Mail behavior Pattern

No rules support confidence lift

(52)

A.3 HTTP behavior Pattern

1 {} =>dstip=1 0.711 0.711 1

2 srcip=139 =>dstip=1 0.012 1 1.40

3 srcip=9 =>dstip=1 0.012 1 1.405

4 srcip=41 =>dstip=1 0.012 1 1.405

5 dstport=2006 => dstip=1 0.012 1 1.405

6 dstip=41 =>srcip=164 0.01 1 40.75

7 dstip=55 =>srcip=164 0.012 1 40.75

8 srcip=11 =>dstip=1 0.012 1 1.405

9 dstip=53 =>srcip=242 0.012 1 81.5

10 srcip=242 =>dstip=53 0.012 1 81.5

11 srcip=6 =>dstip=1 0.012 1 1.405

12 srcip=134 =>dstip=2 0.01 1 54.3

13 srcip=439 =>dstip=1 0.012 1 1.405

14 srcip=407 =>dstip=1 0.0127 1 1.405

15 dstport=1273 =>srcip=233 0.012 1 81.5 16 srcip=233 => dstport=1273 0.012 1 81.5

17 dstport=1273 => dstip=1 0.012 1 1.405

18 srcip=424 =>dstip=1 0.012 1 1.405

19 srcip=233 =>dstip=1 0.012 1 1.405

20 srcip=409 =>dstip=1 0.0122 1 1.405

21 srcip=2 =>dstip=1 0.018 1 1.405

22 srcip=50 =>dstip=1 0.018 1 1.405

23 srcip=190 =>dstip=1 0.018 1 1.405

24 srcip=226 =>dstip=49 0.01 1 13.58

25 srcip=389 =>dstip=1 0.0181 1 1.405

26 srcip=382 =>dstip=1 0.011 1 1.405

27 srcip=408 =>dstip=1 0.024 1 1.405

28 srcip=338 =>dstip=1 0.024 1 1.405

29 srcip=75 =>dstip=45 0.024 1 32.6

30 dstip=45 =>srcip=75 0.024 0.8 32.6

31 srcip=344 =>dstip=1 0.024 1 1.405

32 srcip=444 =>dstip=1 0.030 1 1.405

33 srcip=155 =>dstip=1 0.024 0.8 1.124

(53)

A.4 IRC behavior Pattern

1 {} =>dstip=16 0.71 0.71 1

2 srcip=301 =>dstip=16 0.014 1 1.39

3 srcip=247 =>dstip=16 0.014 1 1.39

4 srcip=49 =>dstip=16 0.014 1 1.39

5 srcip=457 =>dstip=16 0.014 1 1.39

6 dstport=39938 => srcip=18 0.014 1 46

7 dstport=39938 =>dstip=3 0.0144 1 4.41

8 srcip=99 =>dstip=16 0.014 1 1.39

9 srcip=168 =>dstport=37321 0.014 1 69

10 dstport=37321 =>srcip=168 0.014 1 69

11 srcip=168 =>dstip=3 0.014 1 4.45

12 dstport=37321 =>dstip=3 0.014 1 4.45

13 srcip=162 =>dstport=37273 0.014 1 69

14 dstport=37273 =>srcip=162 0.014 1 69

15 srcip=162 =>dstip=16 0.014 1 1.39

16 dstport=37273 => dstip=16 0.01 1 1.39

17 srcip=292 =>dstip=3 0.014 1 4.45

18 srcip=428 =>dstip=16 0.014 1 1.39

19 srcip=316 =>dstip=16 0.014 1 1.39

20 srcip=209 =>dstip=16 0.014] 1 1.39

21 srcip=183 =>dstip=16 0.021 1 1.393

22 srcip=18 =>dstip=3 0.021 1 4.45

23 srcip=322 =>dstip=16 0.028 1 1.39

24 srcip=305 =>dstip=16 0.028 1 1.39