Support Vector Machines - Classification Methods

4.3 Data preprocessing for CMU-CERT Insider Threat Data Sets

5.2.2 Classification Methods

5.2.2.3 Support Vector Machines

Support Vector Machine (SVM) or Kernel Machine [101] (termed Support Vector Network [102]) is a supervised learning method that constructs maximum margin hyperplane(s) to separate labelled data instances for a classification or a regression task.

Linear SVM Consider a data set with two class labelsc1, andc2. Each data instance (feature vector) Xt _{at session slot}_t _{has a class label}_yt _{∈ {}_c

1, c2}. The concept of SVM is to find a linear hyperplane (i.e. linear discriminant) to separate the data instances of class labelc1 from the data instances of class labelc2, while maximising the margin. A margin is defined as the distance from the linear hyperplane to the data instances closest to it on either side [101]. These instances are referred to as support vectors. In Figure 5.1, we present a two-class classification task, where the blue circles and the red squares represent the data instances of class labelc1 and class label c2 respectively. Let the solid line represent the optimal hyperplane that

58 Chapter 5. Supervised Learning for Imbalanced Insider Threat Detection

FIGURE5.1: Two-class SVM classification task. The blue circles and

the red squares represent the data instances of class labelc1and class

labelc2 respectively. The solid line represents theoptimal hyperplane

that separates the data instances of the two classes, and the dash lines represent twomargin hyperplanesthat determine the margin. The blue circles and red squares having a shadow represent the support vectors that locate the twomargin hyperplaneswhich determine the mar-

gin.

separates the data instances of the two classes, and the dash lines represent two margin hyperplanesthat determine the margin.

The equation for optimal hyperplane is given in Equation 5.3 given the set of instancesXsatisfying:

w·X−w0 = 0 (5.3)

wherewis a weight vector that is normal (i.e. normal vector) to the hyperplane; and w0 is a scalar threshold that determines the location of the hyperplane with respect to the origin0. Furthermore, the equations for the two margin hyperplanes are given in Equation 5.4 and Equation 5.5:

w·X−w0= +1 (5.4)

w·X−w0=−1 (5.5)

where+1 and −1 allow to maximise the margin for the best generalisation. This means that: ifyt =c1 for an instanceXt, thenw·Xt−w0 ≥ +1(blue circles); and ifyt = c2 for an instanceXt, thenw·Xt−w0 ≤ −1 (red squares). As illustrated

5.2. Background 59

in Figure 5.1, the blue circles and red squares having a shadow representing the support vectors that locate the twomargin hyperplaneswhich determine the margin. As aforementioned, the support vectors are the data instances closest to the optimal hyperplane from either side; the distance from the support vectors to the optimal hyperplane is defined as _k_w1_k. In total, the total margin is _k_w2_k.

Polynomial SVM and Radial SVM However, the data in a classification task may not be linearly separable; there exists no linear hyperplane (i.e. linear model) that separates two classes. Instead of trying to fit a non-linear model, the feature vectors (data instances) in the original feature space are mapped to a high dimensional feature space using a non-linear basis function (kernel function). Based on the new (mapped) feature vectors, a linear hyperplane is determined. In other words, the linear model in the new feature space replaces a non-linear model in the original space [101]. A kernel function is defined as a measure of similarity over pairs of data instances (Xt,Xt0) in a data set. In the following, we introduce two types of kernel functions: polynomial, and radial (or Gaussian).

Thepolynomial kernelfunction is given in Equation 5.6:

K(Xt, Xt0) = ((Xt0)T ·Xt+ 1)q (5.6)

whereXtandXt0are data instances at session slotstandt0respectively;T designates a transpose function; andqdesignates the degree of polynomial.

Theradial kernelfunction is given in Equation 5.7

K(Xt, Xt0) =exp " −kX t₋_Xt0_k2 2s2 # (5.7)

whereexpdesignates an exponential function;sdesignates a covariance parameter (i.e. spread of values). Note that the larger thes, the smoother the discriminant.

Optimisation Methods for SVM To train the SVM classifier, the classification task is defined as an optimisation problem with the aim to maximise the margin and to minimise kwk (see Cortes and Vapnik [102] for more details). Campbell [103]

60 Chapter 5. Supervised Learning for Imbalanced Insider Threat Detection

provides a survey of commonly used optimisation methods to solve an SVM optimisation problem including: column generation methods for linear optimisation problems; conjugate gradient and primal-dual interior point methods for quadratic optimisation problems; chunking method; Sequential Minimal Optimisation (SMO) method; and the Lagrangian method.

5.3 CD-AMOTRE: Class Decomposition with Artificial Mi-

nority Oversampling and Trapper Removal

The imbalance of data weakens the performance of a supervised classifier in both the learning phase and the prediction phase. In the learning phase, this hinders the classifier from finding an optimal decision boundary to separate between the majority class (normal region) and the minority class (anomalous instances). When a classifier tries to predict the class label of a new instance, it faces two different challenges: classifying a minority instance as majority (i.e. false negatives), or classifying a majority instance as minority (i.e. false positives). False negatives may be attributed to the location of the minority instance in a cluster of majority instances, or to the den- sity of the minority instances within the cluster. On the other hand, false positives may be caused by the similarity between the feature value of a majority instance and the neighbour minority instances.

We address the imbalanced data problem in a hybrid approach, namely CD- AMOTRE, which combines class decomposition and AMOTRE oversampling technique. The ultimate aim of this approach is to detect any-behaviour-all-threats – threat hunting(based on observation (1) in Section 5.1), and to reduce the number of false alarms (based on observation (2) in Section 5.1). As stated before, the problem of false alarms is still a limitation in the existing approaches for insider threat detection.

In the following, we present our hybrid approach CD-AMOTRE that comprises two approaches: class decomposition, and the proposed AMOTRE oversampling technique.

5.3. CD-AMOTRE: Class Decomposition with Artificial Minority Oversampling

and Trapper Removal 61

In document Opportunistic machine learning methods for effective insider threat detection (Page 78-82)