New Distributed Framework for Cyber Attack Detection and Classification

(1)

A NEW DISTRIBUTED FRAMEWORK FOR CYBER ATTACK DETECTION AND CLASSIFICATION

By

SANDEEP GUTTA

Bachelor of Engineering in Electronics and Communication Engineering

Andhra University

Visakhapatnam, Andhra Pradesh, India 2008

Submitted to the Faculty of the Graduate College of Oklahoma State University

in partial fulfillment of the requirements for

the Degree of MASTER OF SCIENCE

(2)

COPYRIGHT c By

SANDEEP GUTTA December, 2011

(3)

A NEW DISTRIBUTED FRAMEWORK FOR CYBER ATTACK DETECTION AND CLASSIFICATION

Thesis Approved: Dr. Qi Cheng Thesis Advisor Dr. Martin T. Hagan Dr. Weihua Sheng Dr. Sheryl A. Tucker Dean of the Graduate College

(4)

TABLE OF CONTENTS

Chapter Page

1 INTRODUCTION 2

1.1 Taxonomy of Cyber Attacks . . . 3

1.2 Motivation . . . 5

1.3 Thesis Organization . . . 5

2 CYBER ATTACK DETECTION SYSTEMS 7 2.1 Proposed Cyber Attack Detection System . . . 9

3 SUPPORT VECTOR MACHINES 14 3.1 Margins and Maximum Margin Classifiers . . . 15

3.2 Nonlinear Feature Space Mapping and Kernel Trick . . . 20

3.3 Regularization and Soft Margins . . . 23

3.4 Solving the SVM Dual Problem . . . 26

3.5 Summary . . . 27

4 FAST AND EFFICIENT SVM TRAINING APPROACHES 29 4.1 Existing Fast SVM Training Approaches . . . 29

4.2 A New Fast SVM Training Approach . . . 31

4.2.1 Training Data Clustering . . . 32

4.2.2 Training Data Elimination . . . 40

4.3 Complexity and Performance Analysis . . . 42

(5)

5.1 Basics of Dempster-Shafer Theory . . . 49

5.2 Dempster’s Rule of Combination . . . 51

5.3 Local Fusion Rule at Each Sensor . . . 53

5.4 Global Fusion Rule at the Fusion Center . . . 55

6 EXPERIMENTAL RESULTS 57 6.1 Cyber Attacks Dataset . . . 57

6.2 Training Phase . . . 62

6.3 Decision Making Phase . . . 63

7 CONCLUSIONS 70

BIBLIOGRAPHY 71

(6)

LIST OF TABLES

Table Page

6.1 Types of cyber attacks in the 1999 KDD intrusion detection dataset. 66 6.2 Total training time for the support vector machines. . . 67 6.3 Confusion matrix of the proposed cyber attack detection system on the

NSL-KDD test set (with only old attacks). . . 68 6.4 Confusion matrix of the proposed cyber attack detection system (using

SVMs which are trained normally) on the NSL-KDD test set (with only old attacks). . . 68 6.5 Confusion matrix of the proposed cyber attack detection system on the

(7)

LIST OF FIGURES

Figure Page

2.1 Proposed distributed cyber attack detection system. . . 11

2.2 Local decision at each sensor in the system. . . 13

3.1 Geometric margins. . . 16

3.2 Sensitivity of the original SVM (with hard margin) to outliers. . . 24

4.1 Cluster defined as the convex hull of the data points. . . 34

4.2 L¨owner-John ellipsoid. . . 35

4.3 Logistic sigmoid curve of the training sample elimination model. . . . 42

5.1 Basic probability assignments (BPAs) of two independent sources of evidence. . . 53

5.2 Dempster’s rule of combination of basic probability assignments (BPAs) of two independent sources of evidence. . . 54

6.1 The 1998 DARPA intrusion detection evaluation testbed. . . 58

(8)

This work has been supported in part by the Center for Telecommunications and Network Security (CTANS).

(9)

CHAPTER 1

INTRODUCTION

With the enormous growth of cyber activity and Internet penetration, the number of cyber attacks has increased significantly over the last decade. Detection of these attacks has always been a major concern for many governments and organizations all over the world. Some of the recent well-known cyber attacks include Nimda attack, SQL Slammer attack, July 2009 attacks, and Operation Aurora. The Operation Aurora cyber attacks were launched against major organizations like Google, Yahoo, Adobe Systems, Morgan Stanley, Dow Chemical Company, etc. in the second half of 2009. Recently in April 2011, a series of cyber attacks were launched against Sony’s PlayStation Network which made the network go offline for about 24 days. In May 2011, cyber attacks were launched against Citibank and the account information of about 1% of its 21 million North American credit card customers was stolen. Such cyber attacks incur great expenses and trouble to several organizations. In April 2009, the Pentagon announced that more than $100 million were spent on repairing the damage from the cyber attacks over the past six months. In May 2011, Sony Corporation declared that the cost of its PlayStation Network attack was about $171 million.

Moreover, the computing systems and networks in the critical sectors like military, banking and finance, telecommunications, transportation, medical etc. are vulnerable to these growing cyber threats. Military and financial organizations are always the top target for attackers. The protection of the sensitive data from cyber attacks in the critical sectors like these, is now the topmost priority for government and other

(10)

organizations. Nowadays, politically motivated cyber attacks are also increasing in number. The Economist describes the cyberwarfare as the fifth domain of warfare after land, sea, air, and space. In May 2010, a new agency called the United States Cyber Command (USCYBERCOM) was established by the U.S. government to ex-clusively protect and defend its military networks. Similar agencies were set up by other countries as well to protect their digital infrastructure.

1.1 Taxonomy of Cyber Attacks

Acyber attack (or intrusion) can be defined as a series of malicious computer activi-ties that threaten and compromise the security and integrity of a computer/network system. The cyber attacks disrupt the normal operation of a computer system, and may illegally access or destroy the information in the computer systems. Most of the time, a cyber attack is launched through the data stream on the computer networks. The classification of cyber attacks is helpful for learning the behavior of different at-tacks, which may be used in the design of cyber attack detection systems. In general, cyber attacks can be broadly classified into the following four categories:

1. Denial of Service (DoS) Attacks: The Denial of Service (DoS) attacks are those that make a computer resource (e.g., a Web server) unavailable to the actual legitimate users. The most common form of the DoS attack involves making the computer resource too busy and fully loaded with lots of unnecessary requests, so that the actual users cannot use it. There are many variants of DoS attacks including TCP-SYN Flood, ICMP/UDP Flood, Smurf, Ping of Death, Teardrop, Mailbomb, Apache2.

Denial of Service (DoS) attacks are the most common cyber attacks. The most popular variant of the DoS attack is the Distributed Denial of Service (DDoS) attack. As the name itself suggests, DDoS attacks are launched in a distributed

(11)

fashion. Most often in the case of DoS attacks, the victim’s computing resource is more powerful than the attacker’s computing resource. In such cases, the DoS attacker launches the attack using several multiple intermediate hosts (generally called bots).

2. Remote to Local (R2L) Attacks: In this type of attacks, an attacker tries to illegally gain local access to a computer system, by sending some packets to the system over a network. Some common ways in which this is accomplished is guessing passwords through the brute-force dictionary method, FTP Write, etc.

3. User to Root (U2R) Attacks: In this class of attacks, an attacker with a normal user privileges illegally tries to gain the root access (administrator privileges) to a computer system. One common way in which this is done is by using the buffer overflow methods.

4. Probing Attacks: In this type of attacks, an attacker scans a network/computer to find possible vulnerabilities through which the attacker can exploit the sys-tem. This is like some sort of surveillance on the syssys-tem. One common way in which this is done is through port scanning. By scanning the different ports of a computer system, the attacker can get the information about the open ports, services running, what the hosts in a network are up to, and various other sensi-tive details like the IP address, MAC address, firewall rules, etc. Some examples of probing attacks are IPSweep, NMap, MScan, Satan, SAINT.

There are other attacks such as the Man-in-the-middle (MITM) attacks, Social Engineering attacks, etc. However, these attacks can be prevented by following some personal security measures. For example, MITM attack is a sort of active eaves-dropping in which the attacker hears the conversation between the victims. This is generally done by establishing separate and independent connections with each

(12)

victim, and relaying messages between them without them even knowing it. This MITM attack can be prevented by using strong encryption schemes like the Secure Shell (SSH) protocol, which prevents the attacker from reading the messages.

1.2 Motivation

Cyber attacks have become a major threat nowadays. The cost of damage from a cyber attack is very huge. Protecting the computer networks from these cyber attacks has become the topmost priority. Thus, the problem of cyber attack detection is of great importance. There is an urgent need for an efficient cyber attack detection sys-tem, which can accurately detect the cyber attacks in time so that proper countering actions can be taken to protect the vital cyber infrastructure. Most of the current cyber attack detection systems suffer from two main issues:

1. High computational complexity 2. Low detection accuracy

There is generally a trade-off between these two. In general, the cyber attack detection system with high detection accuracy suffers from high computational complexity. One of the main challenges in designing a cyber attack detection system is to decrease the overall computational complexity of the system without any decrease in its detection accuracy. The focus of this thesis is to address this challenge. In this thesis, a new cyber attack detection system is proposed which has relatively less computational complexity and high detection accuracy.

1.3 Thesis Organization

In this chapter, different types of cyber attacks were introduced. The behavior of different attacks is presented in detail.

(13)

In Chapter 2, different cyber attack detection systems are introduced. The relative merits and demerits of those cyber attack detection systems are discussed. Then, the proposed distributed cyber attack detection system is described in detail.

In Chapter 3, the support vector machines which are used as the binary classifiers in the proposed cyber attack detection system are presented. The support vector machines are generally the best supervised classifiers, and yield the best classification performance. The support vector machines are discussed in detail in this chapter.

In Chapter 4, several fast training approaches to train the support vector machine are discussed. Though the support vector machines are the best classifiers, their training procedure has high computational complexity. This often limits the use of support vector machines in the real-world applications involving huge data (for e.g., cyber attack detection). In this chapter, several existing fast training approaches for support vector machines are presented. Finally, a new fast and efficient training approach for support vector machines is proposed which reduces the support vector machine training complexity without having significant degradation in the classifica-tion performance of the support vector machine.

In Chapter 5, the fusion rules used in the proposed distributed cyber attack de-tection system are discussed. Effective fusion rules are proposed using the Dempster-Shafer theory of evidence.

In Chapter 6, the experimental results are provided. Conclusions are provided in Chapter 7.

(14)

CHAPTER 2

CYBER ATTACK DETECTION SYSTEMS

The cyber attack detection system, also referred to as the intrusion detection system (IDS), continuously monitors the computer/network system trying to identify the cyber attacks while they are going on a computer/network system. Once an attack is detected, the cyber attack detection system alerts the corresponding security pro-fessional who then takes a necessary action. The design of the cyber attack detection system is generally based on the basic assumption that the cyber attack activities are different from the normal activities and hence can be detected [1].

Generally the cyber attack detection systems or intrusion detection systems (IDSs) are classified into two main categories:

1. Signature-based intrusion detection systems 2. Anomaly-based intrusion detection systems

The signature-based intrusion detection system is based on the prior knowledge of known attack signatures. In this approach, an activity is classified as an attack if it matches with an already known attack signature. Therefore the performance of this approach is greatly limited by the signature database available. This approach cannot detect novel attacks, which are quite different from the known attacks. Moreover, defining signatures for all the known attacks is in general difficult. There is a great need for expert knowledge to create the attack signatures. Any error in the attack signature definitions may lead to a large number of missed detections. The common techniques adopted include pattern matching, and rule-based techniques. Some best

(15)

examples of signature-based intrusion detection systems are Snort [2], Bro [3], and Prelude [4].

In the anomaly-based intrusion detection system, an assumption that a cyber attack will always show some deviations from normal patterns is made. First, a normal profile of the system is developed. Then, all the activities that do not match with this normal profile are considered to be attacks. The major drawback in this approach is the difficulty in accurately modeling the normal behavior of the system, which can be highly dynamic. Furthermore, the assumption that the abnormal behavior is only due to attacks is another limitation. To summarize, though the anomaly-based intrusion detection system may sometimes detect novel attacks, the biggest drawback is the lack of accurate profiling of the normal behaviors of a system. This is in general difficult, and this may lead to a very large number of false alarms.

If we observe the conventional intrusion detection approaches mentioned above, it is clear that both of them completely rely on explicit pattern matching techniques. That is, all the activities matching with the known attack signatures are consid-ered to be attacks in the signature-based detection approach, and all the activities not matching with the system’s normal profile are considered to be attacks in the anomaly-based detection approach. It is clear that they perform good when the at-tacks are somewhat similar and regular. But the atat-tacks in general are not regular. Novel attacks are being developed day by day. And the conventional detection ap-proaches cannot adapt to the new attacks that are being developed. Therefore, there is a need for an adaptive intrusion detection approach. The above mentioned difficul-ties lead researchers to apply the pattern classification techniques to the problem of intrusion detection, where the normal and attack models with appropriate decision rules are automatically learned by the system. Hence, these systems are adaptive to new unforeseen attacks. The main advantage of using these pattern classification techniques is their ability to generalize, learn and adapt.

(16)

The intrusion detection systems (IDSs) can also be classified into the following categories based on the type of data they collect:

1. Host-based intrusion detection systems: These IDSs monitor the indi-vidual host systems in a network. They often collect a number of system level details like system calls, application logs, incoming and outgoing network events, system file changes, etc. An example of a host-based intrusion detection system is OSSEC [5].

2. Network-based intrusion detection systems: A network-based IDS moni-tors the whole network for signs of an ongoing attack. It examines the overall network traffic and observes different hosts as well. These network-based IDSs access the network traffic through a network tap, or by connecting to a network switch configured for port mirroring. An example of a network-based IDS is the well known Snort [2]. Another open-source network-based IDS is the Bro [3]. An example of commercial network-based IDS is the McAfee Network Security Platform [6].

3. Application-based intrusion detection systems: These IDSs monitor only a specific application by collecting the corresponding necessary data. They use external sensors that capture the data exchanged between the monitored application and other third party entities with which the application interacts. 4. Hybrid intrusion detection systems: These IDSs are a combination of two

or more of the above described IDSs.

2.1 Proposed Cyber Attack Detection System

In this thesis, a new network-based cyber attack detection system (intrusion detection system) is proposed. The proposed cyber attack detection system is designed in a

(17)

distributed fashion, using multiple sources of information (sensors). The sensing technology has greatly improved in recent times, and a wide variety of advanced sensors are now available which can collect a lot of information from the network. Each sensor is a source of information which independently operates and collects different types of data from the network. Using a variety of multiple sensors, we can have a complete view of the network and hence the cyber attacks can be detected more accurately.

The data collected from all the sensors in the system can be processed in two ways: centralized approach and decentralized (or distributed) approach. In the centralized processing approach, each sensor transmits its entire data to a central unit. The central unit receives the data from all the sensors and processes it to generate a final decision. This traditional centralized processing approach has high computational complexity, requires huge bandwidth, and is practically inefficient. On the other hand, the distributed processing approach is more efficient and has relatively less computational complexity. In this distributed approach, each sensor first process its data and generates a local decision. All the local decisions from all the sensors are then transmitted to the central unit (generally referred to as the fusion center), which then generates the final decision based on all the available local decisions. Clearly, this distributed approach is computationally more efficient and requires very less bandwidth. Moreover with the recent advancements in the field of sensing, currently available sensors all have computing capabilities. Thus we propose a new distributed cyber attack detection system which is robust and efficient than the traditional ones. Let us assume there are L −1 known types of cyber attacks. Thus there are a total of L classes including the normal class. Let there be M sensors, which are distributed wide across the network observing different aspects of the network under consideration. Each sensor processes the observed data and makes a local decision regarding the network condition. The local decision of the sensor Sj is uj. All

(18)

these local decisions are then transmitted to the fusion center, which generates the final decision about the state of the network using the available local decisions. The proposed distributed cyber attack detection system is shown in Figure 2.1.

Figure 2.1: Proposed distributed cyber attack detection system.

At each sensor, there are L binary classifiers with each classifier distinguishing one class from the rest. Let g_ij(·) denote the binary classifier i at the sensor j. The binary classifier g_ij(·) at the sensor j classifies the observed data record into either “belonging to class i” or “not belonging to class i.” The proposed local fusion rule

(19)

at each sensor generates the corresponding local decision based on the outputs of the

L binary classifiers. This is clearly shown in Figure 2.2.

The binary classifiers in the proposed system are designed using the machine learning (statistical learning) approach. In general, we assume that we initially have data (generally referred to as the training data) of different types of cyber attacks available. All the binary classifiers are initially trained on this available training data. Gradually over time, when the data about the new types of cyber attacks becomes available, the training data is updated and the classifiers are retrained using the updated training data. In order to decrease the computational complexity of the classifier training process, a new fast and efficient training method is proposed in this thesis. Effective fusion rules, at the sensors and the fusion center, are proposed using the Dempster-Shafer theory of evidence.

(20)

(21)

CHAPTER 3

SUPPORT VECTOR MACHINES

In this chapter, we describe in detail the support vector machine (SVM), which is used as the binary classifier in our proposed cyber attack detection system. Support vector machine (SVM) is generally considered to be the best off-the-shelf supervised classifier. The SVM tends to find a decision boundary (hyperplane) between the two classes, which lies at a maximum distance from both the classes. The main advantage of the support vector machines (SVMs) is that the parameters are found by solving a convex optimization problem, which makes the solution globally optimal.

The support vector machine (SVM) is a sparse kernel method. Kernel methods are a special class of pattern classification techniques, in which the decisions (or predictions) are made using the entire training data, or a subset of it. The support vector machine (SVM) is a kernel method with sparse solution. The SVM decisions (or predictions) are made using only few training data points.

The problem of supervised binary classification can be formulated as follows. The training setT={(xi, yi);i= 1, ..., n}is given. The training samplexi has dfeatures, i.e., xi ∈ Rd. Thus each training sample xi is a point in Rd space. The label of the sample xi is yi ∈ {−1,+1} corresponding to two different classes. The SVM classifier design is to learn a functiony =f(x) which not only classifies the training data accurately but also generalizes well to the new samples (samples with unknown labels). Mathematically, the SVM classifier can be defined as

y=f(x) = sgn(wTΦ(x) +b), (3.1)

(22)

space mapping which transforms the original d-dimensional input space into some higher dimensional feature (Hilbert) space. From (3.1), it is clear that the SVM parameters w and b define a linear hyperplane in the feature space corresponding to the nonlinear mapping Φ(·). The optimal values of the SVM parameters w and

b are found by solving a convex optimization problem which maximizes the margin between the two classes.

3.1 Margins and Maximum Margin Classifiers

First, we discuss the concept of margins. Figure 3.1 shows the decision boundary be-tween two classes. The parameter vectorwis orthogonal to the decision hyperplane. Consider a training sample (data point) at A. The geometric margin (or simply mar-gin) of the decision boundary with respect to this training sample A is the distance of the point A to the decision boundary. This margin is represented by the line segment AB in Figure 3.1. Mathematically, the margin of the decision boundary represented by the parameters (w, b) with respect to a training sample (xi, yi) is defined as

γi =yi w kwk T xi+ b kwk ! . (3.2)

Using the notion of margins, the optimal classifier with good generalization is the one which maximizes the margin with respect to the entire training set T = {(xi, yi);i = 1, ..., n}. Such classifiers that separate the training samples with the maximum margin (large gap) are called maximum margin classifiers. The support vector machine (SVM) is one such maximum margin classifier. Given a training set T = {(xi, yi);i = 1, ..., n}, the support vector machine (SVM) tries to solve the following optimization problem:

max

w,b γ

s.t. yi(wTxi+b)≥γ, i= 1, ..., n kwk= 1.

(23)

Figure 3.1: Geometric margins.

In the above optimization problem we are trying to maximize the margin γ, subject to the constraint that every training sample has the margin at least γ. The solution of the optimization problem (3.3) gives the hyperplane (w, b) with the largest possible margin with respect to the entire training set.

The above optimization problem (3.3) cannot be solved directly, as the constraint kwk = 1 is a nonconvex one. We can try to embed the constraint kwk= 1 into the objective function, which results in the following optimization problem:

max w,b ˆ γ kwk s.t. yi(wTxi+b)≥γ, iˆ = 1, ..., n (3.4)

Solving this optimization problem (3.4) is also in general difficult as the objective function in (3.4) is a nonconvex one. To transform this optimization problem (3.4)

(24)

into a convex one, we make use of the fact that scaling the parameters (w, b) does not affect the geometric margin. This is clear from the equation (3.2). We scale the parameters (w, b) by some factor such that ˆγ = 1. Now the objective function of the optimization problem (3.4) becomes _k_w1_k. Moreover, maximizing _k_w1_k is equivalent to minimizingkwk2. So we now have the following optimization problem:

min w,b 1 2kwk 2 s.t. yi(wTxi+b)≥1, i= 1, ..., n (3.5)

The scaling factor 1

2 in the objective function is introduced for mathematical

con-venience. This optimization problem (3.5) is a convex one. More precisely, the optimization problem (3.5) is a quadratic programming (QP) problem, which tries to minimize a quadratic function subject to a set of linear inequality constraints.

In practice, the dual problem of the above problem is solved. The Lagrangian of the above problem (3.5) is

L(w, b,λ) = 1 2kwk 2 − n X i=1 λi(yi(wTxi+b)−1), (3.6) where λ = {λi ≥ 0, i = 1, ..., n} are the Lagrange multipliers. The Lagrange dual function (or simply dual function) is

D(λ) = inf w,bL(w, b,λ) = inf w,b 1 2kwk 2 − n X i=1 λi(yi(wTxi+b)−1) ! . (3.7)

The Lagrange dual function is the minimum value of the Lagrangian over the pa-rameters (w, b). To find the dual function, we first need to set the derivatives of the Lagrangian L(w, b,λ) with respect to w and b equal to zero.

∇wL(w, b,λ) =w− n X i=1 λiyixi = 0, (3.8a) ∂ ∂bL(w, b,λ) = n X i=1 λiyi = 0. (3.8b)

(25)

From (3.8a), we have w= n X i=1 λiyixi. (3.9)

Plugging the value of wfrom (3.9) into the Lagrangian (3.6), we get

L(b,λ) = n X i=1 λi− 1 2 n X i=1 n X j=1 yiyjλiλj(xi)Txj−b n X i=1 λiyi. (3.10) From (3.8b), the last term on the RHS of the above equation is zero. Finally, the Lagrange dual function is

D(λ) = n X i=1 λi− 1 2 n X i=1 n X j=1 yiyjλiλj(xi)Txj. (3.11) The Lagrange dual function (3.11) gives lower bounds on the optimal value of the original primal problem (3.5). If p? _{is the optimal value of the primal optimization} problem (3.5), then for any λ≥0 we have

D(λ)≤p?. (3.12)

It is clear that the Lagrange dual function D(λ) gives lower bound on the optimal value p? _{of the primal optimization problem (3.5) for every} _λ _≥ _{0. The best lower} bound that can be given by the dual function can be found by the following optimiza-tion problem, which is called the Lagrange dual problem (or simply dual problem).

max

λ D(λ)

s.t. λ≥0.

(3.13)

The dual problem of our original optimization problem (3.5) is max λ D(λ) = n X i=1 λi− 1 2 n X i=1 n X j=1 yiyjλiλj(xi)Txj s.t. λi ≥0, i= 1, ..., n n X i=1 λiyi = 0. (3.14)

The above dual problem (3.14) is also a quadratic programming (QP) problem, which tries to minimize a quadratic function subject to a set of inequality and equality

(26)

constraints. Letd? denote the optimal value of the above dual problem (3.14). Thus, by definition, we always have d? ≤p?. The nonnegative quantityp?−d? is called the duality gap. In order to actually solve the dual problem (3.14) in place of the original primal problem (3.5), we need to have zero duality gap, i.e.,d? ₌_p?_{. In order to have} zero duality gap, the primal optimal points (w?_{, b}?_{) and the dual optimal point} _λ? must satisfy the Karush-Kuhn-Tucker (KKT) conditions:

yi(w?Txi+b?)≥1, i= 1, ..., n (3.15a)

λ?_i ≥0, i= 1, ..., n (3.15b)

λ?_i(1−yi(w?Txi+b?)) = 0, i= 1, ..., n (3.15c) ∇(w,b)L(w?, b?,λ?) = 0. (3.15d)

These KKT conditions hold in our case here. The KKT condition (3.15c) is an important one, which is generally referred to as the KKT dual complementarity con-dition. It states that, for every data point xi, either the corresponding λi = 0 or

yi(w?Txi+b?) = 1. Thus we can solve the dual problem (3.14) instead of the original primal problem (3.5). The various algorithms used to solve the dual problem (3.14) are discussed later in this chapter.

Once the dual problem (3.14) is solved, we get the dual optimal pointλ?. Having found the dual optimal pointλ?, we can find the primal optimal pointw? using (3.9). The optimal valueb?_{can be calculated using (3.9) and the KKT dual complementarity} condition (3.15c). In order to classify new samples using the trained SVM (w?_{, b}?_), we just need to evaluate the sign of the quantity f(x) =w?T_x₊_b?_{. Using (3.9), we}

(27)

have f(x) = w?Tx+b? = n X i=1 λ?_iyixi !T x+b? = n X i=1 λ?_iyi(xi) T x+b? = n X i=1 λ?_iyihxi,xi+b?, (3.16)

wherehxi,xi represents the inner product between the pointsxi andx. The optimal valueb?_{in the above equation can be directly calculated using (3.9) and the KKT dual} complementarity condition (3.15c). The value b? _{also depends only on the Lagrange} multipliers {λi}, training data points {xi}, and the training data labels {yi}. It is clear from (3.16) that once the dual optimal point λ? is found by solving the dual problem (3.14), the new data points can be classified directly using (3.16). There is no need to explicitly calculate the primal optimal point (w?_{, b}?_{) for future predictions} about new data points. From (3.16), it is clear that the quantityf(x) mainly depends on the inner product between the new pointxand the points in the training set. From the the KKT dual complementarity condition (3.15c) and (3.16), it is clear that only few data points have nonzero λi. Such points in the training set are called support vectors, and hence the name support vector machine (SVM). All the training data points except support vectors have λi = 0, and thus does not play any role in forming the decision boundary and making predictions for new data points.

3.2 Nonlinear Feature Space Mapping and Kernel Trick

From the above discussion, it clear that the support vector machine (SVM) is a maximum margin classifier which tries to separate the linearly separable data. Most of the time in real-world situations, the data is not linearly separable in the originald -dimensional input spaceI. However, the data can be linearly separable in some higher

(28)

dimensional feature space (Hilbert space) H. Hence, the original input space I is transformed into a higher dimensional feature space H through a general nonlinear feature mapping Φ(·). The SVM is applied in this new higher dimensional space. In order to do this, we need to replace x everywhere in the SVM algorithm with Φ(x). Since the SVM algorithm can be expressed entirely using the inner product hxp,xqi, we need to replace all such inner products with hΦ(xp),Φ(xq)i. This new inner product can be explicitly defined using a kernel function. For a given nonlinear feature mapping Φ(·), the corresponding kernel function is defined as

K(xp,xq) =hΦ(xp),Φ(xq)i=(Φ(xp))T_Φ(xq)_. _(3.17) Thus, we can just replace the inner producthxp,xqieverywhere in the SVM algorithm by the corresponding kernel function K(xp,xq).

Given the nonlinear feature mapping Φ(·), the corresponding kernelK(·,·) can be easily calculated. But, sometimes determining the nonlinear feature mapping Φ(·) is very difficult, especially in very high-dimensional cases. However most of the time, we can efficiently construct the corresponding kernel functionK(·,·) directly without even having to find the feature mapping Φ(·) explicitly. This is a huge advantage. However, we need to make sure that the constructed function is a valid kernel, i.e., the constructed kernel function should correspond to a scalar product in some (high-dimensional) feature space. One way to do this is to expand the chosen kernel function and identify the corresponding mapping Φ(·). This may be difficult in some situations. A more simple way to check whether a function is a valid kernel or not is to use the Mercer’s condition.

For a function K(·,·) to be valid kernel, corresponding to some nonlinear fea-ture mapping Φ(·), it needs to satisfy the Mercer’s condition: Given a data set {x1,x2, ...,xn},K :Rd×Rd7→R is a valid kernel if the corresponding kernel matrix Kis symmetric positive-semidefinite.

(29)

in the SVM algorithm with the kernel function K(xp,xq) corresponding to a higher dimensional feature space, where the data is linearly separable, and apply the SVM algorithm in the new feature space. This is generally referred to as the kernel trick, which can be used for any classifier learning algorithms that can be explicitly ex-pressed in terms of inner products between the data points.

Some examples of the kernels include: • Linear kernel:

K(xp,xq) = xT_pxq. (3.18)

This is the simplest kernel corresponding to the feature mapping Φ(x) =x. • Polynomial kernel:

K(xp,xq) = xTpxq d

, (3.19)

whered is the degree. • Gaussian kernel: K(xp,xq) = exp − kxp−xqk 2 2σ2 ! . (3.20)

The Gaussian kernel (3.20) corresponds to an infinite dimensional feature map-ping. The Gaussian kernel is also referred to as the radial basis function kernel (RBF kernel).

• Hyberbolic tangent kernel:

K(xp,xq) = tanh(a xTpxq

+b), (3.21)

for somea >0 and b <0.

Also new kernels can be constructed from the old kernels using the following properties. Given valid kernels K1(·,·) andK2(·,·), the following kernels will also be

(30)

• K(xp,xq) =K1(xp,xq) +K2(xp,xq). • K(xp,xq) =K1(xp,xq)K2(xp,xq).

• K(xp,xq) =cK1(xp,xq), where c is a constant. • K(xp,xq) = exp(K1(xp,xq)).

• K(xp,xq) =f(K1(xp,xq)), wheref(·) is a polynomial with nonnegative coeffe-cients.

• K(xp,xq) =f(xp)K1(xp,xq)f(xq), where f(·) is any function.

3.3 Regularization and Soft Margins

Till now, we have assumed that the training data is linearly separable either in the original input space or in some higher dimensional feature space. However in some real-world situations, this may not be the case. The data cannot be linearly separable, even in the higher dimensional spaces. This is due to the fact that the underlying true class distributions, that generate the data, may have a significant overlap. In such cases, trying to exactly separate the training data may lead to overfitting and hence poor generalization. In such cases, we need to allow the SVM to misclassify some of the training points for good generalization. Also sometimes, there may be some extreme outliers in the data. The original SVM algorithm, which tries to ex-actly separate the training data, will be extremely sensitive to such outliers. This is clearly illustrated in the Figure 3.2. In such cases as well, we need to allow the misclassification of some of the training points.

In case of linearly separable classes, the original SVM uses an error function that gives an infinite error when a training sample is misclassified and zero error when it is correctly classified, which is optimized with respect to the SVM parameters to maximize the margin [7]. Now we can modify this error function so that the data

(31)

Figure 3.2: Sensitivity of the original SVM (with hard margin) to outliers. points are allowed to be on the wrong side of the decision boundary with a penalty that increases with the distance from the decision boundary. The original SVM optimization problem (3.5) can be reformulated using L1 regularization as

min w,b 1 2kwk 2 +C n X i=1 ξi s.t. yi(wTxi+b)≥1−ξi, i= 1, ..., n ξi ≥0, i= 1, ..., n (3.22)

where the quantities ξi, i = 1, ..., n are called slack variables. They are defined as follows:

• For the data points that are correctly classified, and which lie on or outside the correct side SVM margin boundary,ξi = 0.

(32)

side SVM margin, 0< ξi <1.

• For the data points that lie exactly on the decision boundary, ξi = 1.

• For the data points that are misclassified (i.e., those which lie on the wrong side of the decision boundary),ξi >1.

Thus, for every data point xi that is misclassified, the objective function in (3.22) is penalized by the quantity Cξi. Hence the parameter C > 0 controls the trade-off between the twin objectives of maximizing the margin and minimizing the number of data points that are allowed to be misclassified. The parameter C can also be interpreted as analogous to a regularization parameter which controls the trade-off between minimizing the number of classification errors and controlling the model complexity. Thus the SVM is relaxed to include some misclassifications, which leads to a soft margin instead of a hard margin.

As before, the dual problem of the above problem (3.22) is solved in practice. The Lagrangian of the above problem (3.22) is

L(w, b,ξ,λ,ν) = 1 2kwk 2 +C n X i=1 ξi− n X i=1 λi(yi(wTxi+b)−1 +ξi)− n X i=1 νiξi, (3.23) where λ = {λi ≥ 0, i = 1, ..., n} and ν = {νi ≥ 0, i = 1, ..., n} are the Lagrange multipliers. The dual problem is given by

max λ D(λ) = n X i=1 λi− 1 2 n X i=1 n X j=1 yiyjλiλj(xi)Txj s.t. 0≤λi ≤C, i= 1, ..., n n X i=1 λiyi = 0. (3.24)

The above dual problem (3.24) is similar to the earlier dual problem (3.14) with the exception of the constraint on the Lagrange multipliers λ = {λi, i = 1, ..., n}. The kernel trick can be applied here, and the inner product in the above dual problem is replaced by the kernelK(xi,xj).

(33)

3.4 Solving the SVM Dual Problem

We now describe the algorithms used to solve the SVM dual problem (3.24). The SVM dual problem (3.24) is a quadratic programming (QP) problem, which tries to minimize a quadratic function subject to a set of inequality and equality constraints. Though the support vector machine (SVM) is a sparse kernel algorithm which uses relatively few basis functions (as defined by the support vectors) to make predictions for new data, the training/learning algorithm of the SVM uses the entire training data. Thus, some efficient algorithms are required for solving the SVM dual problem. The dual problem (3.24) is a quadratic programming (QP) problem. The objective function in (3.24) is a quadratic one, which implies that any local optimal point is also a global optimal point. Directly solving the quadratic programming (QP) problem is often difficult and infeasible as the computational and memory requirements are prohibitively high. The kernel matrix K ∈ _Rn×n itself takes huge memory space. When the size of the training data (n) is large, which is often the case in many real-world applications, the SVM runs out of memory. Thus, some practical approaches need to used in solving the SVM dual quadratic programming (QP) problem.

One such method is the chunking method [8]. In the chunking method, the actual QP problem is broken down into a series of smaller QP problems, which identify the non-zero Lagrange multipliers. The basic idea in this method is to scale down the size of the kernel matrix by discarding all those elements which correspond to zero Lagrange multipliers, and finally the scaled down QP problem is solved. Though this method scales down the size of the problem, the computational complexity of this method is still high for large training datasets as the method still involves several matrix operations, and computation of gradients of the dual function. Decomposi-tion methods [9] also try to solve a series of series of smaller QP problems. The only difference is that each of these smaller QP problems is of a fixed size. These meth-ods also suffer from the same disadvantages, and are not suitable for large training

(34)

datasets.

In general, these methods are found to have computational complexity O(n3), where n is the size of the training dataset. These methods clearly does not scale well with the training data. For arbitrarily large training datasets, these methods are not suitable. In the next chapter, some fast and efficient training algorithms are presented which can be applied in applications involving very large datasets.

3.5 Summary

The support vector machine (SVM) is a maximum margin classifier which tries to separate the data of two classes with maximum possible margin. Since the SVM gen-erates a linear hyperplane (decision boundary), in order to generalize to nonlinearly separable data, the kernel trick is employed to transform the input space to a high dimensional space, where the data becomes linearly separable. Thus theoretically, the support vector machine (SVM) has two main phases:

1. The d-dimensional original input space I is transformed into a higher dimen-sional feature space (Hilbert space) H through a general nonlinear mapping Φ(·). Usually, the kernel trick is used which does not require the explicit com-putation of the nonlinear mapping Φ(·).

2. The separating hyperplane (decision boundary), with maximum possible mar-gin, is then constructed in the high dimensional feature spaceH. This maximum margin decision hyperplane is obtained by solving the SVM dual quadratic pro-gramming problem.

The support vector machine (SVM), in general, generates hard classification deci-sions for the new inputs. That is, the SVM decides whether the new input belongs to one class or the other. The SVM does not generate probabilistic outputs. However, in some situations, we are more interested in the probabilistic outputs instead of hard

(35)

decisions. To address this issue, Platt proposed a post-processing approach in which a logistic sigmoid is fitted to the outputs of an already trained support vector machine [10]. This approach is described in detail in Appendix A.

(36)

CHAPTER 4

FAST AND EFFICIENT SVM TRAINING APPROACHES

Support vector machines (SVMs) are generally considered to be the best off-the-shelf supervised classifiers. However, the main drawback of support vector machines is that the training procedure has a very high computational complexity. This high computational complexity often limits the real-time implementation of support vector machines, especially in applications involving very large datasets like cyber attack detection. In this chapter, we first discuss some of the existing approaches which try to reduce the SVM training complexity, and then present our proposed fast training approach which further reduces the SVM training complexity without significantly degrading the classification performance of the SVM.

4.1 Existing Fast SVM Training Approaches

In [11], [12], greedy approximation of the kernel matrix and low-rank kernel repre-sentation are proposed, respectively. The problem with these approximation-based approaches is that they have a significant impact on the classification performance [13]. There other approaches which try to solve the quadratic programming (QP) problem more efficiently. One such method is the chunking method [8], which was described in the previous chapter. As mentioned before, this method is found to have complexity O(n3_{) [14], where} _n _{is the size of the training set, and hence is not}

very efficient for training SVM on large datasets. Another popular approach is the sequential minimal optimization (SMO) [14]. The sequential minimal optimization (SMO) algorithm is basically a coordinate ascent algorithm. The SMO algorithm is a

(37)

simpler and more efficient algorithm for solving the QP problem. The SMO basically takes the previous chunking concept to an extreme limit. The SMO algorithm breaks the original QP problem into a series of smaller QP subproblems. At every step, the SMO solves a smaller optimization problem of finding the optimal values of two Lagrange multipliers, and updates the SVM accordingly. The main advantage of the SMO lies in the fact that solving for two Lagrange multipliers at each step is done analytically, thus avoiding the matrix computations and standard QP calculations on the whole. In general, the SMO method is found to have complexity O(n2) [14], where n is the size of the training dataset. There are some other approaches which try to avoid the quadratic program in the SVM algorithm [15]. However they still involve the kernel trick which has the high complexity, and their performance greatly depends on factors such as random sampling, selection of the hyperparameter values, and stopping criteria.

On the other hand, there are improved training approaches which do not require any type of approximation in the SVM algorithm, but rather focus on the appropriate training data selection for SVMs [16]. In [16], a reduced SVM (RSVM) is proposed based on a simple random sampling. Tong and Koller propose an active training approach where the SVM learner (classifier) sequentially selects the training samples based on certain criterion [17]. Among the proposed three methods in [17], MaxMin and Ratio methods are computationally expensive as they tend to train the SVM multiple times in each iteration. The Simple method, though relatively faster, is more unstable and performed poorly on the Newsgroup data. Its complexity greatly depends on the price of each query. Since the decision boundary of an SVM depends only on a small subset of the training data (support vectors), fast training can be achieved by identifying and selecting training samples that are support vectors. This idea has been explored in [18], [19], [13], [20], [21], [22]. In [18], Abe and Inoue extract the samples close to the boundary using the Mahalanobis distance measure, which

(38)

generally performs well when the data belonging to each class is clustered together and when there is a minimal overlap between the data of different classes. In [19], Shin and Cho propose a method which selects training samples near the boundary using the neighborhood properties of samples. Their method is based on the k-nearest neighbor (k-NN) algorithm, and tends to be computationally expensive in case of large high-dimensional data. Though they used a faster selective k-NN spanning approach, the performance of their method greatly depends on k, whose value was randomly chosen in different situations. In [13], Li et al. propose a method based on edge detection whose performance depends on prefixed parameters including k

(in k-means clustering) and m (the number of neighbors). The method proposed by Lyhyaoui et al. [20] also depends on some additional parameters that need to be set beforehand. In [21], the training sample selection is done using thek-means clustering technique. The performance of their method greatly depends on k, whose value is randomly chosen for different cases. Fuzzy clustering based training data selection is proposed in [22]. The proposed fuzzy clustering method works well when the exact number of clusters in the data is known beforehand. Clearly, the performance of most of these methods depends on certain key parameters which need to be fixed beforehand. These methods generally perform well if the preset parameter values exactly capture the nature of the training data. However, determining these key parameters is generally a non-trivial task, especially in high-dimensional spaces.

4.2 A New Fast SVM Training Approach

We propose a new fast training method in which there is no need of presetting any parameters. The basic idea is to reduce the size of the training data by retaining only the most informative samples that are likely to become support vectors. First, we de-tect the clusters present in the data. We use the nonparametric Bayesian approach for clustering which does not require any preset parameters. After forming the clusters,

(39)

we propose an efficient sample elimination approach based on logistic regression. The proposed sample elimination approach takes two important factors into account: rel-ative position of a sample within the cluster and the relrel-ative position of a sample with respect to the other class data. By doing this, the SVM training complexity is shown to be much reduced without significantly degrading the classification performance of the SVM.

In the proposed fast training approach, our goal is to scale down the training set by retaining only the most relevant (most informative) samples and removing the least relevant samples. The main issue here is to determine which samples are the most informative and which are not. It is known that the samples which are close to the decision boundary are more important to form the boundary than the ones which are far away. In other words, the samples which are close to the boundary tend to be more informative for SVMs. Eventually such samples tend to become the support vectors. We make use of this fact to design our fast training approach. First, we accurately detect the clusters present in the training data. In each cluster, the outward samples which are closer to the other class samples, tend to be more informative than the inward samples. We then detect the most informative samples in each cluster, and retain them in the training set. All the other (least informative) samples are safely eliminated from the training set.

4.2.1 Training Data Clustering

First, we need to accurately form multiple clusters for each class present in the training data. One way to do this is to use any existing clustering based technique to form the clusters. However, this simple and heuristic approach may not perform well. This is because most of the existing clustering techniques tend to form spherical clusters only. But the true clusters in the data need not always be spherical. They can exist in any arbitrary shape. We need to accurately model the true shapes of clusters present in

(40)

the data, in order to accurately detect the samples which are well inside the clusters. In this thesis, a more sophisticated method for this task is proposed.

We define a cluster, in a more general way, to be the convex hull of a set of data points, which is the set of all convex combinations of the points. In other words, it is the smallest convex set that contains all the points. Thus, the cluster of p points x1,x2, ...,xp ∈ Rd is defined as

conv{x1,x2, ...,xp}

={λ1x1+λ2x2+...+λpxp | λi ≥ 0,i= 1, ..., p,λ1+λ2+...+λp = 1}.

(4.1)

This is shown in Figure 4.1. In order to determine the center of a cluster, we need to calculate the center of the convex hull. Since the convex hull of a set of points is a polyhedron defined by its vertices, finding the exact center of this polyhedron is in general difficult, especially in high-dimensional spaces. Hence, we approximate the cluster (convex hull) using the Löwner-John ellipsoid, which is the minimum volume ellipsoid containing the cluster. There are many nice properties of ellipsoids. They are generally considered to be an universal geometric approximator of convex sets as they have sufficient degrees of freedom. Furthermore, ellipsoidal approximation is invariant under affine coordinate transformations. Finding this Löwner-John ellipsoid can be formulated as a convex optimization problem. The minimum volume ellipsoid containing the finite set of points{x1,x2, ...,xp}is the same as the minimum volume ellipsoid containing the polyhedron conv{x1,x2, ...,xp}. The Löwner-John ellipsoid approximating a cluster (convex hull) is shown in Figure 4.2.

Generally an ellipsoid in _Rd _{space is defined as}

E(c,P) ={x∈Rd|(x−c)TP−1(x−c)≤1}, (4.2)

where c ∈ _Rd _{is the center of the ellipsoid and} _P _∈

Rd×d is a positive definite

(41)

Figure 4.1: Cluster defined as the convex hull of the data points. Euclidean) norm k·k_P, the above definition is equivalent to

E(c,P) = {x∈Rd| kx−ckP ≤1}, (4.3)

which represents the ellipsoid as a unit ball about the centerc in the corresponding ellipsoidal norm k·k_P. The volume of the ellipsoid _E(c,P) is given by

vol(_E(c,P)) = √detP π d/2

Γ(d/2 + 1), (4.4)

where Γ(·) is the gamma function. The volume of the ellipsoid mainly depends on the determinant of the matrixP. Using this definition of ellipsoid, the L¨owner-John ellipsoid containing the points x1,x2, ...,xp ∈ Rd can be computed by solving the

(42)

Figure 4.2: L¨owner-John ellipsoid. following optimization problem:

minimize √ det P subject to (xi−c)TP−1(xi−c) ≤ 1, i= 1, ..., p P0, (4.5) wherec∈_Rd_and_P_∈

Rd×dare the variables. The above problem (4.5) is a nonconvex

one. To convert this into a convex problem, we need to parametrize the ellipsoid as

E(A,b) ={x∈Rd| kAx+bk2 ≤1}, (4.6)

which is the inverse image of an Euclidean unit ball under affine mapping [23]. Here A ∈ _Rd×d _{is a positive definite matrix, and the volume of the ellipsoid is now} pro-portional to the determinant of A−1. The new ellipsoid formulation (4.6) can always be transformed back into the original formulation (4.2) using the following change of

(43)

variables:

c=−A−1_b_,

P=A−2.

(4.7) Using this new definition of ellipsoid, computing the L¨owner-John ellipsoid containing the points x1,x2, ...,xp ∈ Rd can be formulated as

minimize log det A−1

subject to kAxi+bk2≤ 1,i= 1, ..., p

(4.8)

where A∈ _Rd×d _and _b_∈

Rd are the variables. The above problem (4.8) is a convex

optimization problem with the implicit constraint A 0. The convex optimization problem (4.8) can be efficiently solved by the available interior-point methods.

To find the clusters present in the data, we use the nonparametric Bayesian clus-tering method which does not require any preset sensitive parameters such as the number of clusters. Let X={x1, ...,xN}, xn∈ Rd be the training data of one class. First, we model this training data X = {x1, ...,xN}, xn ∈ Rd, by a mixture of K Gaussian components (clusters) given by

p(xn|π,µ,P) = K X k=1 πkN(xn|µk, Pk−1), p(X|π,µ,P) = N Y n=1 p(xn) = N Y n=1 K X k=1 πkN(xn|µk, Pk−1), (4.9)

where Pk = Σ−_k1 is the precision matrix, πk(k = 1, ..., K) are the mixing parameters which must be positive and sum to 1.

We now introduce a latent random variablezncorresponding to each data pointxn. The variablezncan take any of theK discrete values{1, ..., K}and thus indicates the Gaussian distribution (cluster) to which the data point xn belongs. We reformulate the above mixture model (4.9) using these latent variables. The distribution over zn

(44)

is given by p(zn =k) =πk, p(zn) = K Y k=1 πk1(zn=k). (4.10)

The conditional distribution ofxn given zn is given by

p(xn|zn =k) =N(xn|µk, Pk−1), p(xn|zn) = K Y k=1 N(xn|µk, P_k−1) 1(zn=k) . (4.11)

Calculating the marginal distribution of xn from (4.10) and (4.11) gives the original mixture model (4.9). Considering the whole dataset X = {x1, ...,xN}, we have the latent variables z={z1, ..., zN}. We now have

p(z|π) = N Y n=1 K Y k=1 πk1(zn,k), (4.12a) p(X|z,µ,P) = N Y n=1 K Y k=1 N(xn|µk, Pk−1) 1(zn=k) . (4.12b)

For mathematical convenience, we place conjugate priors over the parameters π, µand P. p(π) =p(π1, ..., πK)∼Dirichlet α₀ K, ..., α0 K , p(Pk)∼ W(m1, V), p(µk)∼ N(m2, R−1), (4.13)

whereW(m1, V) is the Wishart distribution and {α0, m1, V, m2, R} are the

hyperpa-rameters.

The conditional posteriors for the parameters can be obtained from the priors in (4.13) and the likelihood in (4.9). The conditional posterior for the means is given by

p(µk|X,z, Pk, m2, R)∼ N xknkPk+m2R nkPk+R , 1 nkPk+R , (4.14)

where nk is the number of data points that belong to component (cluster) k, xk is the mean of these points [24]. Similarly, the conditional posterior for the precisions

(45)

is given by p(Pk|X,z, µk, m1, V)∼ W  m1+nk,(m1+nk)  m1V + X n|zn=k (xn−µk)T(xn−µk)   −1 . (4.15) The mixing parameters πk(k = 1, ..., K) have the symmetric Dirichlet prior with parameter α0/K. p(π) =p(π1, ..., πK)∼Dirichlet α₀ K, ..., α0 K = Γ(α0) Γ(α0/K)K K Y k=1 πα0/_k (K−1). (4.16) Given the mixing parametersπk(k = 1, ..., K), the latent variables z have the multi-nomial distribution. p(z|π) = N Y n=1 K Y k=1 πk1(zn=k), p(z|π) = K Y k=1 πknk, nk = N X n=1 1(zn =k). (4.17)

Integrating out the mixing parameters using the Dirichlet integral, we get the prior onz directly as p(z) =p(z|α0) = Γ(α0) Γ(N +α0) K Y k=1 Γ(nk+α0/K) Γ(α0/K) . (4.18)

Writing the conditional prior (suitable for Gibbs sampling), we have

P(zn=k|z−n, α0) =

n−n,k+α0/K N −1 +α0

, (4.19)

where subscript −n denotes all indices except n and n−n,k is the number of data points (excludingxn) associated with the component (cluster) k. As we do not know the number of clusters (K) beforehand, we take the limitK → ∞on the conditional prior in (4.19). This yields the following conditional prior:

P(zn=k|z−n, α0) = n−n,k N −1 +α0 , (4.20a) P(zn6=zm∀m∈ {−n}|z−n, α0) = α0 N −1 +α0 . (4.20b)

(46)

This conditional prior is often interpreted as the Chinese Restaurant Process (CRP). From the Bayes’ theorem, the posterior of the latent variables z is given by

p(z|X)∝p(X|z)p(z). (4.21)

From the conditional priors in (4.20a)-(4.20b) and the likelihood in (4.12b), we get the conditional posteriors for the latent variablesz:

P(zn=k|z−n,xn, α0, µk, Pk)∝ n−n,k N−1 +α0 N(xn|µk, Pk−1), (4.22a) P(zn 6=zm∀m∈ {−n}|z−n,xn, α0, m1, V, m2, R) ∝ α0 N −1 +α0 Z p(xn|µk, Pk)p(µk, Pk|m1, V, m2, R)dµkdPk, ∝ α0 N −1 +α0 Z N(xn|µk, P_k−1)p(µk|m2, R)p(Pk|m1, V)dµkdPk. (4.22b)

We use a Gibbs sampling method [25] and repeatedly sample from the posterior (4.22a)-(4.22b):

• For allk, sampleµk and Pk according to the equations (4.14)-(4.15). Calculate the RHS of (4.22a).

• Draw 50 samples of µk and Pk from their priors in (4.13) to approximately calculate the integral on the RHS of (4.22b). And then calculate the RHS of (4.22b).

• Draw a sample forzfrom the multinomial distribution with the above calculated event probabilities.

• Remove a component (cluster) when it becomes empty. A new component (cluster) is added when a new unrepresented component from (4.22b) is chosen. • We initially start with a single cluster and preset hyperparameters.

• Stopping criterion: Stop when the components are neither removed nor added for five consecutive iterations.

(47)

The convergence and the accuracy of the above sampling method depends mainly on the hyperparameter α0. If α0 is too low, we may get less number clusters. On the

other hand, if it is too high, we may get more number of clusters. We empirically set α0 = ₁₀₀N . The remaining hyperparameters are set according to the hyperpriors

suggested in [24] 1_{. After the convergence, the components obtained are considered}

to be the clusters present in the data. The L¨owner-John ellipsoid for each cluster is then calculated by (4.8). This clustering process is done for the data of each class separately.

4.2.2 Training Data Elimination

With the above clustering algorithm, every cluster in the training data is accurately modeled by its L¨owner-John ellipsoid. We now make use of the fact that the samples which are close to the decision boundary are more informative than the ones which are far away. The relative informativeness of a sample is determined by the following two factors:

1. Relative position of the sample in the cluster.

2. Relative position of the sample with respect to the other class data.

In other words, the samples which are well inside the clusters and far away from the other class data are considered to be least informative and hence can be eliminated. All such samples which are considered to be least informative among each cluster are eliminated from the training set. Only the most informative samples stay in the training set. The probability that a sample xn stays in the training set is given by a

1_{The presetting of these parameters has minimal effect on the final SVM classification}

(48)

logistic regression model: Pstay(xn) = eG 1 +eG, where G=α1(2v1−1) +α2(2v2−1), v1 =kxn−ck_P, v2 = Pkoth j=1 zij kzmaxk , (4.23)

where α1 and α2 are the parameters of the logistic regression model, c and k·k_P are respectively the center and the ellipsoidal norm of the cluster to which the sample xn belongs, koth is the number of other class clusters, zij is the force vector which represents the closeness of the sample xn to the other class cluster j, zmax is the maximum force vector of all the samples within the cluster to which the sample xn belongs. The force vector between the samplexn and the cluster j is given by

zij = cj −xn kcj−xnk 2 2 , (4.24)

where cj is the center of the cluster j. The numerator of v2 gives the magnitude of

the resultant force between the sample xi and the other class cluster centers, and the denominator is a normalization factor. Infact the indepenent variables v1 and v2

in the logistic model are measures of the aforementioned factors. The above defined variables vary in the interval [0,1]. So the logistic model using the above variables directly will not span the entire probability space of [0,1]. Hence we first apply the affine transformation (·)7→2(·)−1 to the variables.

The logistic model parameters are empirically chosen to beα1 = 10 andα2 = 20.

The values of these parameters govern the final probability of stay (Pstay) values. As we intend to make hard decisions about the sample selection, the probability of stay (Pstay) values need to be either 0 or 1. Hence, relatively higher values were given to the parameters. The corresponding sigmoid curve is shown in Figure 4.3. The values

(49)

of these parameters can be adjusted to make the sigmoid curve more smoother, and more probabilistic decisions can be made about the sample selection as required. Also, we gave relatively more importance to the second variable v2 which measures

the closeness of the sample to the other class data. We retain all the training samples whose probability of stay (Pstay) is equal to 1. All the other samples are eliminated from the training set. Finally the support vector machine is trained using this new training set which contains only the most informative training samples.

Figure 4.3: Logistic sigmoid curve of the training sample elimination model.

4.3 Complexity and Performance Analysis

Computing the L¨owner-John ellipsoid is the key to the proposed training sample selection method. The L¨owner-John ellipsoid for each cluster in the data is computed by solving the convex optimization problem (4.8). The problem (4.8) is a second-order cone programming (SOCP) problem which can be efficiently solved using the available

(50)

interior-point methods. In this work, we have used the self-dual minimization method whose complexity is about O(√dp), where p is the number of points and d is the dimension [26]. However, the L¨owner-John ellipsoid is usually determined by at most

d2₊₃_d

2 points out of theppoints [27]. Thus, the complexity of computing the L¨

owner-John ellipsoid can be further reduced by intelligently selecting these points. For this purpose, we use the active-set strategy based on Sample Covariance Initialization (SCI) proposed by [28]:

1. Define an initial active setX0_a ={x1, ...,xL}such that the affine hull ofx1, ...,xL spans the space_Rd. The Sample Covariance Initialization (SCI) scheme in [28] is used.

2. (ith iteration) Solve (4.8) for the active set Xi_a. Let (ci,Pi) be the solution. 3. If di_n = kxn−cikPi ≤ 1 for n ∈ X\Xi_a, stop. Otherwise, proceed to the next

step.

4. Update the active set to Xi_a+1:

• (Adding points to the active set) Xi_a+1 =Xi_a∪∆Xa. We intend to add the points xn ∈/ Xia whose distance from current center ci in the ellipsoidal norm k·k_Pi (di_n) ≥ 1 and largest. To further reduce the redundancy, we

intend to add the points that are well spread around the current ellip-soid _E(ci,Pi). For this purpose, we gradually consider the points in the descending order of di

n, and add a pointxn to ∆Xa if P

j∈∆Xa(xn−c

i₎T_Pi−1_(x

j−ci)<0. (4.25) • (Removing the points that are no longer necessary)We delete all the points

xn whose din <0.99. Go to step 2.

(51)

Using this active-set strategy, the complexity in computing the L¨owner-John ellipsoid is greatly reduced.

For the analysis purpose, the proposed fast training method is divided into the following four phases where most of the computation happens:

1. Clustering the data.

2. Computing the L¨owner-John ellipsoids for each cluster. 3. Training sample elimination.

4. Training the SVM using the new training set.

For the nonparametric Bayesian clustering approach, we first evaluate the complexity of each iteration. The total amount of computation involved in each iteration can be typically divided into the following parts:

• Sampling the mean vectors and precision matrices from normal and Wishart distributions respectively.

• Calculating the posterior probabilities for each sample.

• Sampling the latent variables z from the multinomial distribution.

The complexity of sampling from normal and Wishart distributions is on the order of

O(d), wheredis the dimension of the data. The complexity of calculating the posterior probabilities for n samples is O(n). The complexity of sampling the latent variables from multinomial distribution is O(n). Thus the complexity of each iteration in the nonparametric Bayesian clustering isO(n). Generally, the total number of iterations is much smaller than the size of the training data. Thus, the overall complexity of the nonparametric Bayesian clustering approach can be approximated toO(n). After clustering, computing the L¨owner-John ellipsoids for the clusters has the complexity

(52)

of stay for each sample, and hence has the complexity O(n). After reducing the training set to m (m n) training samples, the SVM is trained using this new training set. The complexity of this phase is O(m2_{), as we use the SMO method.}

Thus the total complexity of our SVM training algorithm is at most

O(n) +O(n) +O(n) +O(m2), (4.26)

whereas the complexity of conventional SVM training (using the SMO method) is

O(n2_{). With} _m _n_{, this clearly demonstrates the reduction in the computational}

complexity achieved by using the proposed fast training approach.

Our main objective in designing the fast training approach is to decrease the computational complexity of SVM training without having huge degradation in the classification performance of the SVM. This is clearly taken care of in the design of our fast training approach, where we retain only the most relevant samples that are likely to become support vectors, thus greatly reducing the size of the training set and the SVM training complexity.

(53)

CHAPTER 5

FUSION RULES BASED ON DEMPSTER-SHAFER THEORY

In this chapter, we present the data fusion rules which are used at each sensor and the fusion center of the proposed cyber attack detection system. The fusion rules are designed using the Dempster-Shafer theory of evidence. The Dempster-Shafer theory can be considered as a generalized version of the traditional probability theory. The Dempster-Shafer theory is a result of the work by Arthur P. Dempster [29] and Glenn Shafer [30].

The Dempster-Shafer theory is both a theory of evidence and a theory of probable reasoning. It exactly quantifies the amount of evidence available from a source. Also, the Dempster-Shafer theory effectively combines evidence from different sources and defines the degree of belief based on the total evidence available from different sources. Probability theory is the most widely used mathematical framework to represent uncertainty. Generally, uncertainty can be of two types:

1. Aleatoric Uncertainty (Objective or Stochastic Uncertainty): It is the uncertainty due to the random behavior of a system. This type of uncertainty can be described using the idea of chances in general.

2. Epistemic Uncertainty (Subjective Uncertainty): It is the uncertainty due to the lack of information or knowledge about a system. In other words, it is the uncertainty due to ignorance. Generally, it can be described using the idea of beliefs.

(54)

used to describe the aleatoric uncertainties. The probability theory, when applied in aleatoric situations, is generally referred to as the objective (or frequency) probability theory. Hence, the traditional probability theory can handle the aleatoric uncertainty very well. However, the traditional probability theory has been directly extended to characterize the epistemic uncertainty as well (through the Bayesian theory). In the Bayesian theory of probability, the probability is defined and interpreted as the degree of belief in a particular proposition (or hypothesis) on the basis of the available evidence. Thus, the probabilityP(A) represents the degree of belief in the proposition

A based on the available evidence. And the probability distribution (probability density function or probability mass function) is used to represent the amount of available evidence.

The (Bayesian) probability theory cannot accurately characterize the epistemic uncertainty. The basic idea in the Bayesian theory that the degrees of belief always tend to be like the objective probabilities (chances) in their mathematical structure, is not true in general. The degrees of belief do not, in general, have the same mathe-matical properties of objective probabilities (chances). However, the Bayesian theory explicitly forces that the degrees of belief (represented by the Bayesian probabilities) must obey all the mathematical rules of the chances (objective probabilities). The Bayesian theory directly adopts the three basic axioms of (objective) probability to the Bayesian probability (which are the degrees of belief in the Bayesian theory). This makes the Bayesian theory too restrictive and less flexible in modeling epistemic un-certainty (or ignorance). The mathematical rule that makes the Bayesian probability theory too restrictive is the third axiom of probability:

• Additivity axiom of probability: The probability of the union of mutually exclusive (disjoint) events must be equal to the sum of the probabilities of the individual events. Let E1, E2, ..., EN be the mutually exclusive events of the sample space Ω. Then, according to the additivity axiom of the probability

(55)

theory, P(E1∪E2 ∪...∪EN) = P(E1) +P(E2) +...+P(EN) = N X i=1 P(Ei). (5.1)

This rule is not appropriate for modeling the epistemic uncertainty. Due to this assumption, the Bayesian probability theory cannot exactly represent the ignorance. For example, letE ∈Ω represent a proposition and the complementE ∈Ω